Evolving Jailbreak Attacks on LLMs

Evolving Jailbreak Attacks on LLMs

A more efficient approach through pattern and behavior learning

This research introduces Self-Instruct Few-Shot Jailbreaking (SI-FSJ), a novel attack that improves efficiency by decomposing jailbreaking into pattern learning and behavior learning phases.

  • Reduces required context length by up to 80% compared to previous methods
  • Achieves higher attack success rates against advanced models like Meta-Llama-3-8B
  • Employs self-instruction to generate effective attack patterns without human intervention
  • Works across multiple LLM providers, highlighting widespread vulnerabilities

For security professionals, this research reveals critical weaknesses in existing LLM safety measures and demonstrates how attackers can efficiently circumvent content filters with minimal examples, emphasizing the need for more robust defense mechanisms.

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

61 | 157