Defending Against Backdoor Attacks in Black-Box LLMs

Defending Against Backdoor Attacks in Black-Box LLMs

Using Defensive Demonstrations at Test-Time

This research introduces a novel approach to mitigate backdoor vulnerabilities in LLMs without requiring access to their internals.

  • Proposes defensive demonstrations as a protection mechanism against malicious triggers in black-box LLMs
  • Retrieves task-relevant demonstrations from clean corpora to guide model outputs
  • Demonstrates effectiveness without requiring any model modification or retraining
  • Works with commercial LLMs deployed as web services where only API access is available

This advancement is crucial for security teams managing AI systems, as it offers a practical defense strategy for deployed models where traditional training-phase defenses can't be implemented.

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

3 | 157