Jailbreaking Attacks and Defense Mechanisms

Research exploring vulnerabilities in LLMs through jailbreaking attacks and developing effective defense strategies

Hero image

Jailbreaking Attacks and Defense Mechanisms

Research on Large Language Models in Jailbreaking Attacks and Defense Mechanisms

Certified Defense Against LLM Attacks

Certified Defense Against LLM Attacks

First framework providing guaranteed safety against adversarial prompts

Defending Against Backdoor Attacks in Black-Box LLMs

Defending Against Backdoor Attacks in Black-Box LLMs

Using Defensive Demonstrations at Test-Time

JailGuard: Defending Against Prompt-Based Attacks

JailGuard: Defending Against Prompt-Based Attacks

A Universal Framework to Detect Jailbreaking and Hijacking Attempts

Cracking the Safeguards: Understanding LLM Security Vulnerabilities

Cracking the Safeguards: Understanding LLM Security Vulnerabilities

A representation engineering approach to jailbreaking attacks

Evaluating Jailbreak Attacks on LLMs

Evaluating Jailbreak Attacks on LLMs

A new framework to assess attack effectiveness rather than just model robustness

Hidden Attacks: The New Threat to AI Safety

Hidden Attacks: The New Threat to AI Safety

Exploiting embedding spaces to bypass safety measures in open-source LLMs

Efficient Attacks on LLM Defenses

Efficient Attacks on LLM Defenses

Making adversarial attacks 1000x more computationally efficient

The Security Paradox in Advanced LLMs

The Security Paradox in Advanced LLMs

How stronger reasoning abilities create new vulnerabilities

Securing LLMs Against Jailbreak Attacks

Securing LLMs Against Jailbreak Attacks

A Novel Defense Strategy Without Fine-Tuning

ImgTrojan: The Visual Backdoor Threat

ImgTrojan: The Visual Backdoor Threat

How a single poisoned image can bypass VLM safety barriers

LLM Judge Systems Under Attack

LLM Judge Systems Under Attack

How JudgeDeceiver Successfully Manipulates AI Evaluation Systems

The Crescendo Attack

The Crescendo Attack

A New Multi-Turn Strategy for Bypassing LLM Safety Guardrails

The Myth of Trigger Transferability

The Myth of Trigger Transferability

Challenging assumptions about adversarial attacks across language models

Enhancing AI Safety: The MoTE Framework

Enhancing AI Safety: The MoTE Framework

Combining reasoning chains with expert mixtures for better LLM alignment

Accelerating Jailbreak Attacks

Accelerating Jailbreak Attacks

How Momentum Optimization Makes LLM Security Vulnerabilities More Exploitable

Breaking Through LLM Defenses

Breaking Through LLM Defenses

New Optimization Method Reveals Security Vulnerabilities in AI Systems

Supercharging LLM Security Testing

Supercharging LLM Security Testing

A novel approach for discovering diverse attack vectors

SelfDefend: A Practical Shield for LLMs

SelfDefend: A Practical Shield for LLMs

Empowering LLMs to protect themselves against diverse jailbreak attacks

Structural Vulnerabilities in LLMs

Structural Vulnerabilities in LLMs

How uncommon text structures enable powerful jailbreak attacks

Combating Jailbreak Attacks on LLMs

Combating Jailbreak Attacks on LLMs

A standardized toolkit for evaluating harmful content generation

Rethinking LLM Jailbreaks

Rethinking LLM Jailbreaks

Distinguishing True Security Breaches from Hallucinations

Jailbreaking LLMs: Exploiting Alignment Vulnerabilities

Jailbreaking LLMs: Exploiting Alignment Vulnerabilities

A novel attack method that bypasses security measures in language models

Breaking the Logic Chains in LLMs

Breaking the Logic Chains in LLMs

A Mathematical Framework for Understanding Rule Subversion

Smart Query Refinement for Safer LLMs

Smart Query Refinement for Safer LLMs

Using Reinforcement Learning to Improve Prompts and Prevent Jailbreaks

Breaking Through LLM Guardrails with SeqAR

Breaking Through LLM Guardrails with SeqAR

Uncovering security vulnerabilities in large language models through sequential character prompting

Exposing Biases in LLMs Through Adversarial Testing

Exposing Biases in LLMs Through Adversarial Testing

How jailbreak prompts reveal hidden biases in seemingly safe models

Surgical Precision for Safer LLMs

Surgical Precision for Safer LLMs

Enhancing AI safety through targeted parameter editing

The Analytical Jailbreak Threat

The Analytical Jailbreak Threat

How LLMs' Reasoning Capabilities Create Security Vulnerabilities

Fortifying LLMs Against Tampering

Fortifying LLMs Against Tampering

Developing tamper-resistant safeguards for open-weight language models

Breaking Through LLM Defenses

Breaking Through LLM Defenses

A new framework for systematically testing AI safety filters

Hidden Threats: The 'Carrier Article' Attack

Hidden Threats: The 'Carrier Article' Attack

How sophisticated jailbreak attacks can bypass LLM safety guardrails

Testing the Guardians: LLM Security Coverage

Testing the Guardians: LLM Security Coverage

New metrics for detecting jailbreak vulnerabilities in LLMs

Unlocking LLM Security Architecture

Unlocking LLM Security Architecture

Identifying critical 'safety layers' that protect aligned language models

Defending LLMs Against Unsafe Feedback

Defending LLMs Against Unsafe Feedback

Securing RLHF systems from harmful manipulation

Breaking the Guardians: LLM Jailbreak Attacks

Breaking the Guardians: LLM Jailbreak Attacks

A new efficient method to test LLM security defenses

Defending LLMs Against Adversarial Attacks

Defending LLMs Against Adversarial Attacks

Refusal Feature Adversarial Training (ReFAT) for Enhanced Safety

Fortifying LLMs Against Jailbreaking Attacks

Fortifying LLMs Against Jailbreaking Attacks

A data curation approach to enhance model security during customization

Jailbreak Antidote: Smart Defense for LLMs

Jailbreak Antidote: Smart Defense for LLMs

Balancing safety and utility with minimal performance impact

The Hidden Danger in LLM Safety Mechanisms

The Hidden Danger in LLM Safety Mechanisms

How attackers can weaponize false positives in AI safeguards

Breaking the Guards: Advancing Jailbreak Attacks on LLMs

Breaking the Guards: Advancing Jailbreak Attacks on LLMs

Novel optimization method achieves 20-30% improvement in bypassing safety measures

Securing LLM Fine-Tuning

Securing LLM Fine-Tuning

A Novel Approach to Mitigate Security Risks in Instruction-Tuned Models

Securing LLMs at the Root Level

Securing LLMs at the Root Level

A Novel Decoding-Level Defense Strategy Against Harmful Outputs

T-Vaccine: Safeguarding LLMs Against Harmful Fine-Tuning

T-Vaccine: Safeguarding LLMs Against Harmful Fine-Tuning

A targeted layer-wise defense approach for enhanced safety alignment

Identifying Safety-Critical Neurons in LLMs

Identifying Safety-Critical Neurons in LLMs

Mapping the attention heads responsible for safety guardrails

Smarter LLM Safety Guardrails

Smarter LLM Safety Guardrails

Balancing security and utility in domain-specific contexts

Enhancing LLM Security Testing

Enhancing LLM Security Testing

Self-tuning models for more effective jailbreak detection

Stealth Attacks on AI Guardrails

Stealth Attacks on AI Guardrails

New jailbreak vulnerability using benign data mirroring

Self-Hacking VLMs: The IDEATOR Approach

Self-Hacking VLMs: The IDEATOR Approach

Using AI to discover its own security vulnerabilities

Emoji Attack: Undermining AI Safety Guards

Emoji Attack: Undermining AI Safety Guards

How emojis can bypass safety detection systems in Large Language Models

SQL Injection Jailbreak: A New LLM Security Threat

SQL Injection Jailbreak: A New LLM Security Threat

Exploiting structural vulnerabilities in language models

SequentialBreak: Large Language Models Can be Fooled by Embe...

SequentialBreak: Large Language Models Can be Fooled by Embe...

By Bijoy Ahmed Saiem, MD Sadik Hossain Shanto...

Bypassing AI Guardrails with Moralized Deception

Bypassing AI Guardrails with Moralized Deception

Testing how ethical-sounding prompts can trick advanced LLMs into harmful outputs

When Safety Training Falls Short

When Safety Training Falls Short

Testing LLM safety against natural, semantically-related harmful prompts

Guarding LLMs Against Jailbreak Attacks

Guarding LLMs Against Jailbreak Attacks

A proactive defense system that identifies harmful queries before they reach the model

Restoring Safety in Fine-Tuned LLMs

Restoring Safety in Fine-Tuned LLMs

A post-hoc approach to recover safety alignment after fine-tuning

Exploiting Metaphors to Bypass AI Safety

Exploiting Metaphors to Bypass AI Safety

How imaginative language can jailbreak language models

Breaking LLM Safety Guards

Breaking LLM Safety Guards

A Simple Yet Effective Approach to LLM Jailbreaking

Hidden in Plain Sight: LLM Security Threats

Hidden in Plain Sight: LLM Security Threats

How human-readable adversarial prompts bypass security measures

Defending Against Jailbreak Attacks in LLMs

Defending Against Jailbreak Attacks in LLMs

A novel layer-level approach for enhancing AI safety

Evolving Jailbreak Attacks on LLMs

Evolving Jailbreak Attacks on LLMs

A more efficient approach through pattern and behavior learning

The Infinite Jailbreak Problem

The Infinite Jailbreak Problem

How advanced LLMs become easier to manipulate through paraphrasing

The Trojan Horse Technique in LLM Security

The Trojan Horse Technique in LLM Security

How harmless story endings can bypass security safeguards

Humor as a Security Shield

Humor as a Security Shield

Strengthening LLM Defenses Against Injection Attacks

The Scientific Trojan Horse

The Scientific Trojan Horse

How scientific language can bypass LLM safety guardrails

Unveiling LLM Safety Mechanisms

Unveiling LLM Safety Mechanisms

Extracting and analyzing safety classifiers to combat jailbreak attacks

Exposing LLM Security Vulnerabilities

Exposing LLM Security Vulnerabilities

A novel approach to understanding and preventing LLM jailbreaking

The Virus Attack: A Critical Security Vulnerability in LLMs

The Virus Attack: A Critical Security Vulnerability in LLMs

Bypassing Safety Guardrails Through Strategic Fine-tuning

Guarding the Gates: LLM Security Red-Teaming

Guarding the Gates: LLM Security Red-Teaming

Detecting and preventing jailbreaking in conversational AI systems

Defending LLMs Against Harmful Fine-tuning

Defending LLMs Against Harmful Fine-tuning

Simple random perturbations outperform complex defenses

Breaking LLM Safeguards with Universal Magic Words

Breaking LLM Safeguards with Universal Magic Words

How embedding-based security can be bypassed with simple text patterns

Exploiting LLM Vulnerabilities: The Indiana Jones Method

Exploiting LLM Vulnerabilities: The Indiana Jones Method

How inter-model dialogues create nearly perfect jailbreaks

Fortifying AI Defense Systems

Fortifying AI Defense Systems

Constitutional Classifiers: A New Shield Against Universal Jailbreaks

Proactive Defense Against LLM Jailbreaks

Proactive Defense Against LLM Jailbreaks

Using Safety Chain-of-Thought (SCoT) to strengthen model security

Blocking LLM Jailbreaks with Smart Defense

Blocking LLM Jailbreaks with Smart Defense

Nearly 100% effective method to detect and prevent prompt manipulation attacks

Jailbreaking LLMs at Scale

Jailbreaking LLMs at Scale

How Universal Multi-Prompts Improve Attack Efficiency

Exposing the Guardian Shield of AI

Exposing the Guardian Shield of AI

A novel technique to detect guardrails in conversational AI systems

Adversarial Reasoning vs. LLM Safeguards

Adversarial Reasoning vs. LLM Safeguards

New methodologies to identify and strengthen AI security vulnerabilities

Countering LLM Jailbreak Attacks

Countering LLM Jailbreak Attacks

A three-pronged defense strategy against many-shot jailbreaking

Rethinking AI Safety with Introspective Reasoning

Rethinking AI Safety with Introspective Reasoning

Moving beyond refusals to safer, more resilient language models

Breaking the Jailbreakers

Breaking the Jailbreakers

Enhancing Security Through Attack Transferability Analysis

Defending LLMs Against Jailbreak Attacks

Defending LLMs Against Jailbreak Attacks

Why shorter adversarial training is surprisingly effective against complex attacks

Everyday Jailbreaks: The Unexpected Security Gap

Everyday Jailbreaks: The Unexpected Security Gap

How simple conversations can bypass LLM safety guardrails

Defending LLMs from Jailbreak Attacks

Defending LLMs from Jailbreak Attacks

A novel defense framework based on concept manipulation

When AI Deception Bypasses Safety Guards

When AI Deception Bypasses Safety Guards

How language models can be manipulated to override safety mechanisms

The Jailbreak Paradox

The Jailbreak Paradox

How LLMs Can Become Their Own Security Threats

Exploiting LLM Security Weaknesses

Exploiting LLM Security Weaknesses

A novel approach to jailbreaking aligned language models

Smarter Jailbreak Attacks on LLMs

Smarter Jailbreak Attacks on LLMs

Boosting Attack Efficiency with Compliance Refusal Initialization

Reasoning-Enhanced Attacks on LLMs

Reasoning-Enhanced Attacks on LLMs

A novel framework for detecting security vulnerabilities in conversational AI

Hidden Vulnerabilities: The R2J Jailbreak Threat

Hidden Vulnerabilities: The R2J Jailbreak Threat

How rewritten harmful requests evade LLM safety guardrails

Detecting LLM Safety Vulnerabilities

Detecting LLM Safety Vulnerabilities

A fine-grained benchmark for multi-turn dialogue safety

Exploiting LLM Vulnerabilities

Exploiting LLM Vulnerabilities

A new context-coherent jailbreak attack method bypasses safety guardrails

Strengthening LLM Security Against Jailbreaks

Strengthening LLM Security Against Jailbreaks

A Dynamic Defense Approach with Minimal Performance Impact

Bypassing LLM Safety Guardrails

Bypassing LLM Safety Guardrails

How structure transformation attacks can compromise even the most secure LLMs

Exploiting Safety Reasoning in LLMs

Exploiting Safety Reasoning in LLMs

How chain-of-thought safety mechanisms can be bypassed

Strengthening LLM Defense Against Jailbreaking

Strengthening LLM Defense Against Jailbreaking

Using reasoning abilities to enhance AI safety

Defending Against LLM Jailbreaks

Defending Against LLM Jailbreaks

ShieldLearner: A Human-Inspired Defense Strategy

Exploiting LLM Security Vulnerabilities in Structured Outputs

Exploiting LLM Security Vulnerabilities in Structured Outputs

How prefix-tree mechanisms can be manipulated to bypass safety filters

Defending LLMs Against Jailbreaking

Defending LLMs Against Jailbreaking

Efficient safety retrofitting using Direct Preference Optimization

The Anchored Safety Problem in LLMs

The Anchored Safety Problem in LLMs

Why safety mechanisms fail in the template region

Breaking the Jailbreakers

Breaking the Jailbreakers

Understanding defense mechanisms against harmful AI prompts

Bypassing LLM Safety Measures

Bypassing LLM Safety Measures

New attention manipulation technique creates effective jailbreak attacks

Fortifying LLM Defenses

Fortifying LLM Defenses

Systematic evaluation of guardrails against prompt attacks

Real-Time Jailbreak Detection for LLMs

Real-Time Jailbreak Detection for LLMs

Preventing harmful outputs with single-pass efficiency

Defending LLMs Against Jailbreak Attacks

Defending LLMs Against Jailbreak Attacks

A Novel Safety-Aware Representation Intervention Approach

Exposing the Mousetrap: Security Risks in AI Reasoning

Exposing the Mousetrap: Security Risks in AI Reasoning

How advanced reasoning models can be more vulnerable to targeted attacks

Evolving Prompts to Uncover LLM Vulnerabilities

Evolving Prompts to Uncover LLM Vulnerabilities

An automated framework for scalable red teaming of language models

The Safety Paradox in LLMs

The Safety Paradox in LLMs

When Models Recognize Danger But Respond Anyway

Exploiting LLM Vulnerabilities

Exploiting LLM Vulnerabilities

How psychological priming techniques can bypass AI safety measures

Improving Jailbreak Detection in LLMs

Improving Jailbreak Detection in LLMs

A guideline-based framework for evaluating AI security vulnerabilities

Breaking the Guardrails: LLM Security Testing

Breaking the Guardrails: LLM Security Testing

How TurboFuzzLLM efficiently discovers vulnerabilities in AI safety systems

Essence-Driven Defense Against LLM Jailbreaks

Essence-Driven Defense Against LLM Jailbreaks

Moving beyond surface patterns to protect AI systems

The FITD Jailbreak Attack

The FITD Jailbreak Attack

How psychological principles enable new LLM security vulnerabilities

The Vulnerable Depths of AI Models

The Vulnerable Depths of AI Models

How lower layers in LLMs create security vulnerabilities

Flowchart-Based Security Exploit in Vision-Language Models

Flowchart-Based Security Exploit in Vision-Language Models

Novel attack vectors bypass safety guardrails in leading LVLMs

Bypassing LLM Safety Guardrails

Bypassing LLM Safety Guardrails

How Adversarial Metaphors Create New Security Vulnerabilities

Defending LLMs Against Multi-turn Attacks

Defending LLMs Against Multi-turn Attacks

A Control Theory Approach to LLM Security

The Safety-Reasoning Tradeoff

The Safety-Reasoning Tradeoff

How Safety Alignment Impacts Large Reasoning Models

Breaking the Safety Guardrails

Breaking the Safety Guardrails

How Language Models Can Bypass Security in Text-to-Image Systems

Cross-Model Jailbreak Attacks

Cross-Model Jailbreak Attacks

Improving attack transferability through constraint removal

The Hidden Power of 'Gibberish'

The Hidden Power of 'Gibberish'

Why LLMs understanding unnatural language is a feature, not a bug

Detecting LLM Jailbreaks through Geometry

Detecting LLM Jailbreaks through Geometry

A novel defense framework against adversarial prompts

Fortifying LLM Defenses

Fortifying LLM Defenses

A dual-objective approach to prevent jailbreak attacks

Fortifying Multimodal LLMs Against Attacks

Fortifying Multimodal LLMs Against Attacks

First-of-its-kind adversarial training to defend against jailbreak attempts

From Complex to One-Shot: Streamlining LLM Attacks

From Complex to One-Shot: Streamlining LLM Attacks

Converting multi-turn jailbreak attacks into efficient single prompts

Beyond Refusal: A Smarter Approach to LLM Safety

Beyond Refusal: A Smarter Approach to LLM Safety

Teaching AI to explain safety decisions, not just say no

Securing Multimodal AI Systems

Securing Multimodal AI Systems

A Probabilistic Approach to Detecting and Preventing Jailbreak Attacks

Exploiting Dialogue History for LLM Attacks

Exploiting Dialogue History for LLM Attacks

How attackers can manipulate conversation context to bypass safety measures

Strengthening LLM Safety Through Backtracking

Strengthening LLM Safety Through Backtracking

A novel safety mechanism that intercepts harmful content after generation begins

Jailbreaking LLMs Through Fuzzing

Jailbreaking LLMs Through Fuzzing

A new efficient approach to detecting AI security vulnerabilities

Breaking Through LLM Defenses

Breaking Through LLM Defenses

How Tree Search Creates Sophisticated Multi-Turn Attacks

Certified Defense for Vision-Language Models

Certified Defense for Vision-Language Models

A new framework to protect AI models from visual jailbreak attacks

Progressive Defense Against Jailbreak Attacks

Progressive Defense Against Jailbreak Attacks

A novel approach to dynamically detoxify LLM responses

Defending AI Vision Systems Against Attacks

Defending AI Vision Systems Against Attacks

A tit-for-tat approach to protect multimodal AI from visual jailbreak attempts

Breaking Through AI Guardrails

Breaking Through AI Guardrails

New Method Efficiently Bypasses LVLM Safety Mechanisms

MirrorGuard: Adaptive Defense Against LLM Jailbreaks

MirrorGuard: Adaptive Defense Against LLM Jailbreaks

Using entropy-guided mirror prompts to protect language models

Sentinel Shield for LLM Security

Sentinel Shield for LLM Security

Real-time jailbreak detection with a single-token approach

Bypassing LLM Code Safety Guardrails

Bypassing LLM Code Safety Guardrails

How implicit malicious prompts can trick AI code generators

Systematic Jailbreaking of LLMs

Systematic Jailbreaking of LLMs

How iterative prompting can bypass AI safety guardrails

The JOOD Attack: Fooling AI Guardrails

The JOOD Attack: Fooling AI Guardrails

How Out-of-Distribution Strategies Can Bypass LLM Security

Bypassing LLM Safety Guardrails

Bypassing LLM Safety Guardrails

How structured output constraints can be weaponized as attack vectors

The Security-Usability Dilemma in LLM Guardrails

The Security-Usability Dilemma in LLM Guardrails

Evaluating the tradeoffs between protection and functionality

Breaking Through Linguistic Barriers

Breaking Through Linguistic Barriers

How Multilingual and Accent Variations Compromise Audio LLM Security

PiCo: Breaking Through MLLM Security Barriers

PiCo: Breaking Through MLLM Security Barriers

A progressive approach to bypassing defenses in multimodal AI systems

Defending LLMs Against Jailbreak Attacks

Defending LLMs Against Jailbreak Attacks

A Lightweight Token Distribution Approach for Enhanced Security

Securing LLMs Against Harmful Outputs

Securing LLMs Against Harmful Outputs

A Novel Representation Bending Approach for Enhanced Safety

Securing the Guardians: The LLM Security Evolution

Securing the Guardians: The LLM Security Evolution

Analyzing jailbreak vulnerabilities and defense strategies in LLMs

Mapping LLM Vulnerabilities

Mapping LLM Vulnerabilities

A Systematic Classification of Jailbreak Attack Vectors

The Sugar-Coated Poison Attack

The Sugar-Coated Poison Attack

How benign content can unlock dangerous LLM behaviors

Humor as a Security Threat

Humor as a Security Threat

How jokes can bypass LLM safety guardrails

Strengthening LLM Safety Guardrails

Strengthening LLM Safety Guardrails

A reasoning-based approach to balance safety and usability

AdaSteer: Adaptive Defense Against LLM Jailbreaks

AdaSteer: Adaptive Defense Against LLM Jailbreaks

Dynamic activation steering for stronger LLM security with fewer false positives

Securing Reasoning Models Without Sacrificing Intelligence

Securing Reasoning Models Without Sacrificing Intelligence

Balancing safety and reasoning capability in DeepSeek-R1

The Cost of Bypassing AI Guardrails

The Cost of Bypassing AI Guardrails

Measuring the 'Jailbreak Tax' on Large Language Models

Vulnerabilities in LLM Security Guardrails

Vulnerabilities in LLM Security Guardrails

New techniques bypass leading prompt injection protections

Graph-Based Jailbreak Attacks on LLMs

Graph-Based Jailbreak Attacks on LLMs

A systematic approach to identifying security vulnerabilities in AI safeguards

Key Takeaways

Summary of Research on Jailbreaking Attacks and Defense Mechanisms