Exploiting Safety Reasoning in LLMs

Exploiting Safety Reasoning in LLMs

How chain-of-thought safety mechanisms can be bypassed

This research reveals critical security vulnerabilities in advanced AI reasoning models by demonstrating how safety reasoning mechanisms can be exploited.

  • Introduces H-CoT, a novel attack method that hijacks chain-of-thought safety reasoning
  • Successfully jailbreaks state-of-the-art models including OpenAI's o1/o3, DeepSeek-R1, and Gemini 2.0
  • Creates the Malicious-Educator benchmark to disguise dangerous requests as legitimate educational prompts
  • Demonstrates that current safety mechanisms in reasoning models have significant weaknesses

This research highlights urgent security concerns for AI deployment in sensitive contexts, showing that even advanced safety reasoning can be compromised with sophisticated attacks.

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models

95 | 157