Exploiting Safety Reasoning in LLMs

This research reveals critical security vulnerabilities in advanced AI reasoning models by demonstrating how safety reasoning mechanisms can be exploited.

Introduces H-CoT, a novel attack method that hijacks chain-of-thought safety reasoning
Successfully jailbreaks state-of-the-art models including OpenAI's o1/o3, DeepSeek-R1, and Gemini 2.0
Creates the Malicious-Educator benchmark to disguise dangerous requests as legitimate educational prompts
Demonstrates that current safety mechanisms in reasoning models have significant weaknesses

This research highlights urgent security concerns for AI deployment in sensitive contexts, showing that even advanced safety reasoning can be compromised with sophisticated attacks.

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models