Detecting and Mitigating Harmful Content

Research on identifying and preventing harmful or malicious content generated by or input to LLMs

Hero image

Detecting and Mitigating Harmful Content

Research on Large Language Models in Detecting and Mitigating Harmful Content

Combating Hate Speech with Multi-Task Learning

Combating Hate Speech with Multi-Task Learning

Improving generalization for detecting harmful content targeting political figures

Fighting Bias in AI Language Models

Fighting Bias in AI Language Models

A toolkit to detect and mitigate prediction biases in LLMs

Profiling Hate Speech Spreaders Online

Profiling Hate Speech Spreaders Online

Understanding who reshares hate content on social media

Meme Safety for AI Systems

Meme Safety for AI Systems

Evaluating how multimodal models respond to harmful meme content

ChatGPT vs. Spam: New Frontiers in Email Security

ChatGPT vs. Spam: New Frontiers in Email Security

Leveraging LLMs to enhance spam detection accuracy

Securing AI Art Generation

Securing AI Art Generation

A robust approach to filtering harmful concepts in text-to-image models

Combat Fake News with LLMs

Combat Fake News with LLMs

Using Defense Among Competing Wisdom framework to explain detection results

ProFS: A Better Way to Reduce LLM Toxicity

ProFS: A Better Way to Reduce LLM Toxicity

Making LLMs safer through targeted model editing

S-Eval: Securing the Future of LLMs

S-Eval: Securing the Future of LLMs

A systematic framework for comprehensive LLM safety evaluation

AI Threats to Election Integrity

AI Threats to Election Integrity

Mapping GenAI's dangerous potential in electoral manipulation

The Evolution of Deepfake Detection

The Evolution of Deepfake Detection

From Single-Modal to Multi-Modal Approaches for Enhanced Security

Detecting Mixed Reality Fakes

Detecting Mixed Reality Fakes

First benchmark for multimodal misinformation from multiple sources

Fact-Checking Our AI Visionaries

Fact-Checking Our AI Visionaries

A new benchmark for evaluating factuality in multimodal AI systems

Combating Fake News with AI Intelligence

Combating Fake News with AI Intelligence

A Knowledge-guided Framework for Few-shot Detection

Harnessing LVLMs for Fake News Detection

Harnessing LVLMs for Fake News Detection

How Visual-Language Models Outperform in Multimodal Misinformation Classification

Smart Danmaku Moderation for Video Platforms

Smart Danmaku Moderation for Video Platforms

Using LLMs and Impact Captions to Reduce Toxic Comments

Combating Fake News with Adaptive AI

Combating Fake News with Adaptive AI

Using dynamic analysis to detect misinformation across platforms

Bypassing LLM Safety Filters

Bypassing LLM Safety Filters

Critical vulnerability found in safety evaluation metrics

Measuring LLM Safety Through Offensive Content Progression

Measuring LLM Safety Through Offensive Content Progression

A new approach to benchmarking model sensitivity to harmful content

Human-Guided Data Exploration

Human-Guided Data Exploration

Collaborative AI Safety Through Interactive Data Augmentation

Lightweight Security for LLMs

Lightweight Security for LLMs

Making AI safety models smaller without sacrificing performance

Proactive LLM Safety Auditing

Proactive LLM Safety Auditing

A novel approach for detecting catastrophic AI responses before they cause harm

Reliable Guardrails for LLMs

Reliable Guardrails for LLMs

Improving calibration of content moderation systems

Small Models, Big Security Impact

Small Models, Big Security Impact

How fine-tuned SLMs outperform larger models in content moderation

Beyond Binary: Fine-Grained LLM Content Detection

Beyond Binary: Fine-Grained LLM Content Detection

Recognizing the spectrum of human-AI collaboration in text

Fighting Misinformation with Smarter Detection

Fighting Misinformation with Smarter Detection

A Knowledge-Based Approach Using Annotator Reliability

Detecting Out-of-Scope Questions in LLMs

Detecting Out-of-Scope Questions in LLMs

New resources to prevent AI hallucinations when questions seem relevant

Fooling LLM Detectors: The Proxy Attack Strategy

Fooling LLM Detectors: The Proxy Attack Strategy

How simple attacks can make AI-generated text pass as human-written

AI-Generated Fake News in the LLM Era

AI-Generated Fake News in the LLM Era

Assessing human and AI detection capabilities against LLM-crafted misinformation

Real-Time LLM Detection

Real-Time LLM Detection

A New Betting-Based Approach to Identify AI-Generated Content as it Arrives

The Reality Gap in LLM Text Detection

The Reality Gap in LLM Text Detection

Why current detectors fail in real-world scenarios

Crossing Cultural Boundaries in Hate Speech Detection

Crossing Cultural Boundaries in Hate Speech Detection

First multimodal, multilingual hate speech dataset with multicultural annotations

Combating Online Hate with AI

Combating Online Hate with AI

Leveraging GPT-3.5 Turbo to detect and mitigate hate speech on X (Twitter)

Smarter Toxicity Detection in Memes

Smarter Toxicity Detection in Memes

Using Knowledge Distillation and Infusion to Combat Online Toxicity

Developing Smarter LLM Guardrails

Developing Smarter LLM Guardrails

A flexible methodology for detecting off-topic prompts without real-world data

Fighting Rumors with Network Science

Fighting Rumors with Network Science

A Novel Epidemic-Inspired Approach to Rumor Detection

Smarter Security Testing for AI Image Generators

Smarter Security Testing for AI Image Generators

Using LLMs to systematically find vulnerabilities in text-to-image models

Securing LLMs Against Harmful Content

Securing LLMs Against Harmful Content

A Dynamic Filtering Approach Without Retraining

Uncovering Hidden Messages Online

Uncovering Hidden Messages Online

Novel methodology for detecting coded 'dog whistles' in digital content

Combating Evolving Toxic Content

Combating Evolving Toxic Content

Adaptable Detection Systems for Security in LLMs

Combating Organized Disinformation

Combating Organized Disinformation

Network-informed prompt engineering for detecting astroturfing in social media

Unmasking Deceptive UI Designs

Unmasking Deceptive UI Designs

Automated Detection of Manipulation in User Interfaces

Harnessing LLMs to Combat Climate Misinformation

Harnessing LLMs to Combat Climate Misinformation

Using AI with human oversight to detect false climate claims

HateBench: Evaluating Hate Speech Detection Against LLM Threats

HateBench: Evaluating Hate Speech Detection Against LLM Threats

First comprehensive benchmark for testing detectors against LLM-generated hate content

Modeling Self-Destructive Reasoning in LLMs

Modeling Self-Destructive Reasoning in LLMs

A mathematical framework for tracking toxicity amplification in language models

Emotion-Aware Cyberbullying Detection

Emotion-Aware Cyberbullying Detection

Beyond Binary Classification: Detecting Harassment and Defamation

Securing AI: Advanced Safety Testing for LLMs

Securing AI: Advanced Safety Testing for LLMs

Automating comprehensive safety evaluation with ASTRAL

The Dark Side of Persuasion

The Dark Side of Persuasion

How LLMs Use Personalization and False Statistics to Change Minds

Automating Counterspeech Evaluation

Automating Counterspeech Evaluation

A novel framework for measuring effectiveness in combating hate speech

Safeguarding AI: Pre-Deployment Testing of LLMs

Safeguarding AI: Pre-Deployment Testing of LLMs

Insights from External Safety Evaluation of OpenAI's o3-mini Model

Divergent Emotional Patterns in Disinformation on Social Med...

Divergent Emotional Patterns in Disinformation on Social Med...

By Iván Arcos, Paolo Rosso...

Detecting LLM-Laundered Fake News

Detecting LLM-Laundered Fake News

How AI-paraphrased misinformation evades current detection systems

Challenges and Innovations in LLM-Powered Fake News Detectio...

Challenges and Innovations in LLM-Powered Fake News Detectio...

By Jingyuan Yi, Zeqiu Xu...

SafeSwitch: Smarter AI Safety Controls

SafeSwitch: Smarter AI Safety Controls

Using internal activation patterns to regulate LLM behavior without sacrificing capability

Almost Surely Safe Alignment of Large Language Models at Inf...

Almost Surely Safe Alignment of Large Language Models at Inf...

By Xiaotong Ji, Shyam Sundhar Ramesh...

Evaluating the Blind Spots of LLM Safety

Evaluating the Blind Spots of LLM Safety

Can larger models accurately detect harm in smaller ones?

Media Bias Detector: Designing and Implementing a Tool for R...

Media Bias Detector: Designing and Implementing a Tool for R...

By Jenny S Wang, Samar Haider...

LLMs and Annotation Disagreement

LLMs and Annotation Disagreement

How AI handles ambiguity in offensive language detection

Adaptive Prompting: Ad-hoc Prompt Composition for Social Bia...

Adaptive Prompting: Ad-hoc Prompt Composition for Social Bia...

By Maximilian Spliethöver, Tim Knebler...

Balancing AI Safety and Scientific Freedom

Balancing AI Safety and Scientific Freedom

A benchmark for evaluating LLM safety mechanisms against dual-use risks

Intent-Aware Repair for Safer LLMs

Intent-Aware Repair for Safer LLMs

Precision-targeting toxic behaviors without compromising model capabilities

Securing LLM Interactions: The Guardrail Approach

Securing LLM Interactions: The Guardrail Approach

A comprehensive safety pipeline for trustworthy AI interactions

Beyond Binary: Tackling Hate Speech Detection Challenges

Beyond Binary: Tackling Hate Speech Detection Challenges

Innovative approaches to handle annotator disagreement in content moderation

FLAME: Flexible LLM-Assisted Moderation Engine

FLAME: Flexible LLM-Assisted Moderation Engine

By Ivan Bakulin, Ilia Kopanichuk...

Demystifying Hateful Content: Leveraging Large Multimodal Mo...

Demystifying Hateful Content: Leveraging Large Multimodal Mo...

By Ming Shan Hee, Roy Ka-Wei Lee

VLDBench: Vision Language Models Disinformation Detection Be...

VLDBench: Vision Language Models Disinformation Detection Be...

By Shaina Raza, Ashmal Vayani...

Unlocking LLM Transparency with Sparse Autoencoders

Unlocking LLM Transparency with Sparse Autoencoders

Optimizing interpretable features for critical classifications

Combating Misinformation with AI

Combating Misinformation with AI

Introducing HintsOfTruth: A Multimodal Dataset for Checkworthiness Detection

Securing LLMs Against Unsafe Prompts

Securing LLMs Against Unsafe Prompts

A Novel Gradient Analysis Approach with Minimal Reference Data

AI-Powered Deception Detection in Negotiations

AI-Powered Deception Detection in Negotiations

Using Game Theory to Unmask 'Too Good to Be True' Offers

Smart Safety Guardrails for LLMs

Smart Safety Guardrails for LLMs

Optimizing security without compromising performance

Combating Hateful Memes with Advanced AI

Combating Hateful Memes with Advanced AI

Breakthrough in fine-tuning multimodal models for online safety

ThinkGuard: Deliberative Safety for LLMs

ThinkGuard: Deliberative Safety for LLMs

Enhancing AI guardrails through slow, deliberative thinking

Advancing Extreme Speech Detection

Advancing Extreme Speech Detection

Comparing Open-Source vs. Proprietary LLMs for Content Moderation

Red Flag Tokens: A New Approach to LLM Safety

Red Flag Tokens: A New Approach to LLM Safety

Enhancing harmfulness detection without compromising model capabilities

Explainable Propaganda Detection

Explainable Propaganda Detection

Using LLMs to detect and explain propagandistic content

MemeIntel: Smart Detection of Harmful Memes

MemeIntel: Smart Detection of Harmful Memes

AI-powered explainable detection system for propaganda and hate speech

Detecting Disguised Toxic Content with AI

Detecting Disguised Toxic Content with AI

Using LLMs to Extract Effective Search Queries

Geographically-Aware Hate Speech Detection

Geographically-Aware Hate Speech Detection

Evaluating LLMs for culturally contextualized content moderation

Controlling Toxic AI Outputs

Controlling Toxic AI Outputs

A statistical approach to safer large language models

Balancing Safety and Utility in AI Role-Playing

Balancing Safety and Utility in AI Role-Playing

New frameworks for managing dangerous content in character simulations

Safety Risks in LLM Role-Play

Safety Risks in LLM Role-Play

Uncovering & addressing security vulnerabilities when LLMs assume character roles

Fighting Fake News with AI

Fighting Fake News with AI

Comparing LLM-based strategies for misinformation detection

Uncovering Hidden Toxicity in Language

Uncovering Hidden Toxicity in Language

A novel approach to detecting implicit harmful content in LLMs

Filtering Harm in LLM Training Data

Filtering Harm in LLM Training Data

Evaluating safety strategies and their implications for vulnerable groups

SafeSpeech: Detecting Toxicity Across Conversations

SafeSpeech: Detecting Toxicity Across Conversations

Beyond message-level analysis to context-aware toxic language detection

Smarter Content Moderation for LLMs

Smarter Content Moderation for LLMs

Risk-level assessment for safer AI platforms

AI-Powered Violence Detection in Historical Texts

AI-Powered Violence Detection in Historical Texts

Large Language Models Automate Analysis of Ancient Violence

Evaluating AI Models for Inclusive Computing Language

Evaluating AI Models for Inclusive Computing Language

Benchmarking LLMs' ability to detect harmful technical terminology

Protecting Children in the Age of AI

Protecting Children in the Age of AI

New benchmark for evaluating LLM content risks for minors

Simulating Moderation at Scale

Simulating Moderation at Scale

Using LLMs to evaluate online content moderation strategies

Securing Code LLMs Against Harmful Content

Securing Code LLMs Against Harmful Content

A novel automated framework for robust content moderation in code generation

Fixing False Refusals in AI Safety

Fixing False Refusals in AI Safety

Making LLMs smarter about when to say 'no'

Detecting Coded Islamophobia Online

Detecting Coded Islamophobia Online

Using LLMs to Identify and Analyze Extremist Language

Harnessing LLMs for Bug Report Analysis

Harnessing LLMs for Bug Report Analysis

Using AI to extract failure-inducing inputs from natural language bug reports

Breaking Language Barriers in Content Moderation

Breaking Language Barriers in Content Moderation

Adapting LLMs for Low-Resource Languages: The Sinhala Case Study

Enhanced Harmful Content Detection

Enhanced Harmful Content Detection

Combining LLMs with Knowledge Graphs for Safer AI Systems

Multi-label Hate Speech Detection

Multi-label Hate Speech Detection

Advancing beyond binary classification for more effective content moderation

Protecting AI Across Languages

Protecting AI Across Languages

A Multilingual Approach to LLM Content Moderation

Defending Against LLM Jailbreaks

Defending Against LLM Jailbreaks

A Feature-Aware Approach to Detecting and Mitigating Harmful Outputs

Detecting Sexism Across Modalities

Detecting Sexism Across Modalities

First multimodal Spanish dataset for sexism detection in social media videos

Smarter Hate Speech Detection

Smarter Hate Speech Detection

Using Selective Examples to Identify Subtle Harmful Content

Key Takeaways

Summary of Research on Detecting and Mitigating Harmful Content