Bypassing LLM Safety Guardrails

This research reveals critical vulnerabilities in safety-aligned Large Language Models by encoding harmful requests in alternative syntax formats.

Achieves ~90% success rate against strict LLMs like Claude 3.5 Sonnet
Uses diverse syntax spaces from SQL queries to LLM-generated custom syntaxes
Demonstrates ability to generate malicious content despite safety mechanisms
Exposes significant gaps in current alignment approaches

These findings highlight the urgent need for more robust security measures in LLM deployment, as current safety guardrails can be circumvented through structure transformations rather than traditional prompt engineering.

Original Paper: StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models