
Defending Against Backdoor Attacks in Black-Box LLMs
Using Defensive Demonstrations at Test-Time
This research introduces a novel approach to mitigate backdoor vulnerabilities in LLMs without requiring access to their internals.
- Proposes defensive demonstrations as a protection mechanism against malicious triggers in black-box LLMs
- Retrieves task-relevant demonstrations from clean corpora to guide model outputs
- Demonstrates effectiveness without requiring any model modification or retraining
- Works with commercial LLMs deployed as web services where only API access is available
This advancement is crucial for security teams managing AI systems, as it offers a practical defense strategy for deployed models where traditional training-phase defenses can't be implemented.
Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations