The Trojan Horse Technique in LLM Security

This research reveals a simple yet effective jailbreak attack called the Happy Ending Attack that tricks LLMs by appending seemingly harmless conclusions to malicious prompts.

Achieves high success rates (up to 95%) against leading LLMs including GPT-4 and Claude
Requires no complex optimization or multi-turn interactions
Shows strong transferability across different models and resistance to system prompts
Exploits a fundamental vulnerability in how LLMs prioritize narrative coherence over safety

This research matters for security because it exposes a critical flaw in current defense mechanisms that could be exploited at scale with minimal effort, highlighting the need for more robust safety measures in LLM deployments.

Dagger Behind Smile: Fool LLMs with a Happy Ending Story