Jailbreaking Attacks and Defense Mechanisms
Research exploring vulnerabilities in LLMs through jailbreaking attacks and developing effective defense strategies

Jailbreaking Attacks and Defense Mechanisms
Research on Large Language Models in Jailbreaking Attacks and Defense Mechanisms

Certified Defense Against LLM Attacks
First framework providing guaranteed safety against adversarial prompts

Defending Against Backdoor Attacks in Black-Box LLMs
Using Defensive Demonstrations at Test-Time

JailGuard: Defending Against Prompt-Based Attacks
A Universal Framework to Detect Jailbreaking and Hijacking Attempts

Cracking the Safeguards: Understanding LLM Security Vulnerabilities
A representation engineering approach to jailbreaking attacks

Evaluating Jailbreak Attacks on LLMs
A new framework to assess attack effectiveness rather than just model robustness

Hidden Attacks: The New Threat to AI Safety
Exploiting embedding spaces to bypass safety measures in open-source LLMs

Efficient Attacks on LLM Defenses
Making adversarial attacks 1000x more computationally efficient

The Security Paradox in Advanced LLMs
How stronger reasoning abilities create new vulnerabilities

Securing LLMs Against Jailbreak Attacks
A Novel Defense Strategy Without Fine-Tuning

ImgTrojan: The Visual Backdoor Threat
How a single poisoned image can bypass VLM safety barriers

LLM Judge Systems Under Attack
How JudgeDeceiver Successfully Manipulates AI Evaluation Systems

The Crescendo Attack
A New Multi-Turn Strategy for Bypassing LLM Safety Guardrails

The Myth of Trigger Transferability
Challenging assumptions about adversarial attacks across language models

Enhancing AI Safety: The MoTE Framework
Combining reasoning chains with expert mixtures for better LLM alignment

Accelerating Jailbreak Attacks
How Momentum Optimization Makes LLM Security Vulnerabilities More Exploitable

Breaking Through LLM Defenses
New Optimization Method Reveals Security Vulnerabilities in AI Systems

Supercharging LLM Security Testing
A novel approach for discovering diverse attack vectors

SelfDefend: A Practical Shield for LLMs
Empowering LLMs to protect themselves against diverse jailbreak attacks

Structural Vulnerabilities in LLMs
How uncommon text structures enable powerful jailbreak attacks

Combating Jailbreak Attacks on LLMs
A standardized toolkit for evaluating harmful content generation

Rethinking LLM Jailbreaks
Distinguishing True Security Breaches from Hallucinations

Jailbreaking LLMs: Exploiting Alignment Vulnerabilities
A novel attack method that bypasses security measures in language models

Breaking the Logic Chains in LLMs
A Mathematical Framework for Understanding Rule Subversion

Smart Query Refinement for Safer LLMs
Using Reinforcement Learning to Improve Prompts and Prevent Jailbreaks

Breaking Through LLM Guardrails with SeqAR
Uncovering security vulnerabilities in large language models through sequential character prompting

Exposing Biases in LLMs Through Adversarial Testing
How jailbreak prompts reveal hidden biases in seemingly safe models

Surgical Precision for Safer LLMs
Enhancing AI safety through targeted parameter editing

The Analytical Jailbreak Threat
How LLMs' Reasoning Capabilities Create Security Vulnerabilities

Fortifying LLMs Against Tampering
Developing tamper-resistant safeguards for open-weight language models

Breaking Through LLM Defenses
A new framework for systematically testing AI safety filters

Hidden Threats: The 'Carrier Article' Attack
How sophisticated jailbreak attacks can bypass LLM safety guardrails

Testing the Guardians: LLM Security Coverage
New metrics for detecting jailbreak vulnerabilities in LLMs

Unlocking LLM Security Architecture
Identifying critical 'safety layers' that protect aligned language models

Defending LLMs Against Unsafe Feedback
Securing RLHF systems from harmful manipulation

Breaking the Guardians: LLM Jailbreak Attacks
A new efficient method to test LLM security defenses

Defending LLMs Against Adversarial Attacks
Refusal Feature Adversarial Training (ReFAT) for Enhanced Safety

Fortifying LLMs Against Jailbreaking Attacks
A data curation approach to enhance model security during customization

Jailbreak Antidote: Smart Defense for LLMs
Balancing safety and utility with minimal performance impact

The Hidden Danger in LLM Safety Mechanisms
How attackers can weaponize false positives in AI safeguards

Breaking the Guards: Advancing Jailbreak Attacks on LLMs
Novel optimization method achieves 20-30% improvement in bypassing safety measures

Securing LLM Fine-Tuning
A Novel Approach to Mitigate Security Risks in Instruction-Tuned Models

Securing LLMs at the Root Level
A Novel Decoding-Level Defense Strategy Against Harmful Outputs

T-Vaccine: Safeguarding LLMs Against Harmful Fine-Tuning
A targeted layer-wise defense approach for enhanced safety alignment

Identifying Safety-Critical Neurons in LLMs
Mapping the attention heads responsible for safety guardrails

Smarter LLM Safety Guardrails
Balancing security and utility in domain-specific contexts

Enhancing LLM Security Testing
Self-tuning models for more effective jailbreak detection

Stealth Attacks on AI Guardrails
New jailbreak vulnerability using benign data mirroring

Self-Hacking VLMs: The IDEATOR Approach
Using AI to discover its own security vulnerabilities

Emoji Attack: Undermining AI Safety Guards
How emojis can bypass safety detection systems in Large Language Models

SQL Injection Jailbreak: A New LLM Security Threat
Exploiting structural vulnerabilities in language models

SequentialBreak: Large Language Models Can be Fooled by Embe...
By Bijoy Ahmed Saiem, MD Sadik Hossain Shanto...

Bypassing AI Guardrails with Moralized Deception
Testing how ethical-sounding prompts can trick advanced LLMs into harmful outputs

When Safety Training Falls Short
Testing LLM safety against natural, semantically-related harmful prompts

Guarding LLMs Against Jailbreak Attacks
A proactive defense system that identifies harmful queries before they reach the model

Restoring Safety in Fine-Tuned LLMs
A post-hoc approach to recover safety alignment after fine-tuning

Exploiting Metaphors to Bypass AI Safety
How imaginative language can jailbreak language models

Breaking LLM Safety Guards
A Simple Yet Effective Approach to LLM Jailbreaking

Hidden in Plain Sight: LLM Security Threats
How human-readable adversarial prompts bypass security measures

Defending Against Jailbreak Attacks in LLMs
A novel layer-level approach for enhancing AI safety

Evolving Jailbreak Attacks on LLMs
A more efficient approach through pattern and behavior learning

The Infinite Jailbreak Problem
How advanced LLMs become easier to manipulate through paraphrasing

The Trojan Horse Technique in LLM Security
How harmless story endings can bypass security safeguards

Humor as a Security Shield
Strengthening LLM Defenses Against Injection Attacks

The Scientific Trojan Horse
How scientific language can bypass LLM safety guardrails

Unveiling LLM Safety Mechanisms
Extracting and analyzing safety classifiers to combat jailbreak attacks

Exposing LLM Security Vulnerabilities
A novel approach to understanding and preventing LLM jailbreaking

The Virus Attack: A Critical Security Vulnerability in LLMs
Bypassing Safety Guardrails Through Strategic Fine-tuning

Guarding the Gates: LLM Security Red-Teaming
Detecting and preventing jailbreaking in conversational AI systems

Defending LLMs Against Harmful Fine-tuning
Simple random perturbations outperform complex defenses

Breaking LLM Safeguards with Universal Magic Words
How embedding-based security can be bypassed with simple text patterns

Exploiting LLM Vulnerabilities: The Indiana Jones Method
How inter-model dialogues create nearly perfect jailbreaks

Fortifying AI Defense Systems
Constitutional Classifiers: A New Shield Against Universal Jailbreaks

Proactive Defense Against LLM Jailbreaks
Using Safety Chain-of-Thought (SCoT) to strengthen model security

Blocking LLM Jailbreaks with Smart Defense
Nearly 100% effective method to detect and prevent prompt manipulation attacks

Jailbreaking LLMs at Scale
How Universal Multi-Prompts Improve Attack Efficiency

Exposing the Guardian Shield of AI
A novel technique to detect guardrails in conversational AI systems

Adversarial Reasoning vs. LLM Safeguards
New methodologies to identify and strengthen AI security vulnerabilities

Countering LLM Jailbreak Attacks
A three-pronged defense strategy against many-shot jailbreaking

Rethinking AI Safety with Introspective Reasoning
Moving beyond refusals to safer, more resilient language models

Breaking the Jailbreakers
Enhancing Security Through Attack Transferability Analysis

Defending LLMs Against Jailbreak Attacks
Why shorter adversarial training is surprisingly effective against complex attacks

Everyday Jailbreaks: The Unexpected Security Gap
How simple conversations can bypass LLM safety guardrails

Defending LLMs from Jailbreak Attacks
A novel defense framework based on concept manipulation

When AI Deception Bypasses Safety Guards
How language models can be manipulated to override safety mechanisms

The Jailbreak Paradox
How LLMs Can Become Their Own Security Threats

Exploiting LLM Security Weaknesses
A novel approach to jailbreaking aligned language models

Smarter Jailbreak Attacks on LLMs
Boosting Attack Efficiency with Compliance Refusal Initialization

Reasoning-Enhanced Attacks on LLMs
A novel framework for detecting security vulnerabilities in conversational AI

Hidden Vulnerabilities: The R2J Jailbreak Threat
How rewritten harmful requests evade LLM safety guardrails

Detecting LLM Safety Vulnerabilities
A fine-grained benchmark for multi-turn dialogue safety

Exploiting LLM Vulnerabilities
A new context-coherent jailbreak attack method bypasses safety guardrails

Strengthening LLM Security Against Jailbreaks
A Dynamic Defense Approach with Minimal Performance Impact

Bypassing LLM Safety Guardrails
How structure transformation attacks can compromise even the most secure LLMs

Exploiting Safety Reasoning in LLMs
How chain-of-thought safety mechanisms can be bypassed

Strengthening LLM Defense Against Jailbreaking
Using reasoning abilities to enhance AI safety

Defending Against LLM Jailbreaks
ShieldLearner: A Human-Inspired Defense Strategy

Exploiting LLM Security Vulnerabilities in Structured Outputs
How prefix-tree mechanisms can be manipulated to bypass safety filters

Defending LLMs Against Jailbreaking
Efficient safety retrofitting using Direct Preference Optimization

The Anchored Safety Problem in LLMs
Why safety mechanisms fail in the template region

Breaking the Jailbreakers
Understanding defense mechanisms against harmful AI prompts

Bypassing LLM Safety Measures
New attention manipulation technique creates effective jailbreak attacks

Fortifying LLM Defenses
Systematic evaluation of guardrails against prompt attacks

Real-Time Jailbreak Detection for LLMs
Preventing harmful outputs with single-pass efficiency

Defending LLMs Against Jailbreak Attacks
A Novel Safety-Aware Representation Intervention Approach

Exposing the Mousetrap: Security Risks in AI Reasoning
How advanced reasoning models can be more vulnerable to targeted attacks

Evolving Prompts to Uncover LLM Vulnerabilities
An automated framework for scalable red teaming of language models

The Safety Paradox in LLMs
When Models Recognize Danger But Respond Anyway

Exploiting LLM Vulnerabilities
How psychological priming techniques can bypass AI safety measures

Improving Jailbreak Detection in LLMs
A guideline-based framework for evaluating AI security vulnerabilities

Breaking the Guardrails: LLM Security Testing
How TurboFuzzLLM efficiently discovers vulnerabilities in AI safety systems

Essence-Driven Defense Against LLM Jailbreaks
Moving beyond surface patterns to protect AI systems

The FITD Jailbreak Attack
How psychological principles enable new LLM security vulnerabilities

The Vulnerable Depths of AI Models
How lower layers in LLMs create security vulnerabilities

Flowchart-Based Security Exploit in Vision-Language Models
Novel attack vectors bypass safety guardrails in leading LVLMs

Bypassing LLM Safety Guardrails
How Adversarial Metaphors Create New Security Vulnerabilities

Defending LLMs Against Multi-turn Attacks
A Control Theory Approach to LLM Security

The Safety-Reasoning Tradeoff
How Safety Alignment Impacts Large Reasoning Models

Breaking the Safety Guardrails
How Language Models Can Bypass Security in Text-to-Image Systems

Cross-Model Jailbreak Attacks
Improving attack transferability through constraint removal

The Hidden Power of 'Gibberish'
Why LLMs understanding unnatural language is a feature, not a bug

Detecting LLM Jailbreaks through Geometry
A novel defense framework against adversarial prompts

Fortifying LLM Defenses
A dual-objective approach to prevent jailbreak attacks

Fortifying Multimodal LLMs Against Attacks
First-of-its-kind adversarial training to defend against jailbreak attempts

From Complex to One-Shot: Streamlining LLM Attacks
Converting multi-turn jailbreak attacks into efficient single prompts

Beyond Refusal: A Smarter Approach to LLM Safety
Teaching AI to explain safety decisions, not just say no

Securing Multimodal AI Systems
A Probabilistic Approach to Detecting and Preventing Jailbreak Attacks

Exploiting Dialogue History for LLM Attacks
How attackers can manipulate conversation context to bypass safety measures

Strengthening LLM Safety Through Backtracking
A novel safety mechanism that intercepts harmful content after generation begins

Jailbreaking LLMs Through Fuzzing
A new efficient approach to detecting AI security vulnerabilities

Breaking Through LLM Defenses
How Tree Search Creates Sophisticated Multi-Turn Attacks

Certified Defense for Vision-Language Models
A new framework to protect AI models from visual jailbreak attacks

Progressive Defense Against Jailbreak Attacks
A novel approach to dynamically detoxify LLM responses

Defending AI Vision Systems Against Attacks
A tit-for-tat approach to protect multimodal AI from visual jailbreak attempts

Breaking Through AI Guardrails
New Method Efficiently Bypasses LVLM Safety Mechanisms

MirrorGuard: Adaptive Defense Against LLM Jailbreaks
Using entropy-guided mirror prompts to protect language models

Sentinel Shield for LLM Security
Real-time jailbreak detection with a single-token approach

Bypassing LLM Code Safety Guardrails
How implicit malicious prompts can trick AI code generators

Systematic Jailbreaking of LLMs
How iterative prompting can bypass AI safety guardrails

The JOOD Attack: Fooling AI Guardrails
How Out-of-Distribution Strategies Can Bypass LLM Security

Bypassing LLM Safety Guardrails
How structured output constraints can be weaponized as attack vectors

The Security-Usability Dilemma in LLM Guardrails
Evaluating the tradeoffs between protection and functionality

Breaking Through Linguistic Barriers
How Multilingual and Accent Variations Compromise Audio LLM Security

PiCo: Breaking Through MLLM Security Barriers
A progressive approach to bypassing defenses in multimodal AI systems

Defending LLMs Against Jailbreak Attacks
A Lightweight Token Distribution Approach for Enhanced Security

Securing LLMs Against Harmful Outputs
A Novel Representation Bending Approach for Enhanced Safety

Securing the Guardians: The LLM Security Evolution
Analyzing jailbreak vulnerabilities and defense strategies in LLMs

Mapping LLM Vulnerabilities
A Systematic Classification of Jailbreak Attack Vectors

The Sugar-Coated Poison Attack
How benign content can unlock dangerous LLM behaviors

Humor as a Security Threat
How jokes can bypass LLM safety guardrails

Strengthening LLM Safety Guardrails
A reasoning-based approach to balance safety and usability

AdaSteer: Adaptive Defense Against LLM Jailbreaks
Dynamic activation steering for stronger LLM security with fewer false positives

Securing Reasoning Models Without Sacrificing Intelligence
Balancing safety and reasoning capability in DeepSeek-R1

The Cost of Bypassing AI Guardrails
Measuring the 'Jailbreak Tax' on Large Language Models

Vulnerabilities in LLM Security Guardrails
New techniques bypass leading prompt injection protections
