The Trojan Horse Technique in LLM Security

The Trojan Horse Technique in LLM Security

How harmless story endings can bypass security safeguards

This research reveals a simple yet effective jailbreak attack called the Happy Ending Attack that tricks LLMs by appending seemingly harmless conclusions to malicious prompts.

  • Achieves high success rates (up to 95%) against leading LLMs including GPT-4 and Claude
  • Requires no complex optimization or multi-turn interactions
  • Shows strong transferability across different models and resistance to system prompts
  • Exploits a fundamental vulnerability in how LLMs prioritize narrative coherence over safety

This research matters for security because it exposes a critical flaw in current defense mechanisms that could be exploited at scale with minimal effort, highlighting the need for more robust safety measures in LLM deployments.

Dagger Behind Smile: Fool LLMs with a Happy Ending Story

63 | 157