
Breaking the Logic Chains in LLMs
A Mathematical Framework for Understanding Rule Subversion
This research introduces Logicbreaks, a formal framework that reveals how malicious prompts can subvert rule-following in large language models by exploiting vulnerabilities in their logical reasoning processes.
- Formalizes rule-following as propositional Horn logic and proves that even well-designed models remain vulnerable
- Identifies specific attack patterns that can mislead LLMs into making incorrect logical inferences
- Demonstrates that common defensive measures may be insufficient against sophisticated logical manipulation
- Provides a theoretical foundation for understanding jailbreak attacks in real-world AI systems
For security professionals, this work offers critical insights into fundamental vulnerabilities in AI guardrails, helping anticipate and mitigate potential exploits that bypass logical safety mechanisms.
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference