Detecting and Mitigating Harmful Content
Research on identifying and preventing harmful or malicious content generated by or input to LLMs

Detecting and Mitigating Harmful Content
Research on Large Language Models in Detecting and Mitigating Harmful Content

Combating Hate Speech with Multi-Task Learning
Improving generalization for detecting harmful content targeting political figures

Fighting Bias in AI Language Models
A toolkit to detect and mitigate prediction biases in LLMs

Profiling Hate Speech Spreaders Online
Understanding who reshares hate content on social media

Meme Safety for AI Systems
Evaluating how multimodal models respond to harmful meme content

ChatGPT vs. Spam: New Frontiers in Email Security
Leveraging LLMs to enhance spam detection accuracy

Securing AI Art Generation
A robust approach to filtering harmful concepts in text-to-image models

Combat Fake News with LLMs
Using Defense Among Competing Wisdom framework to explain detection results

ProFS: A Better Way to Reduce LLM Toxicity
Making LLMs safer through targeted model editing

S-Eval: Securing the Future of LLMs
A systematic framework for comprehensive LLM safety evaluation

AI Threats to Election Integrity
Mapping GenAI's dangerous potential in electoral manipulation

The Evolution of Deepfake Detection
From Single-Modal to Multi-Modal Approaches for Enhanced Security

Detecting Mixed Reality Fakes
First benchmark for multimodal misinformation from multiple sources

Fact-Checking Our AI Visionaries
A new benchmark for evaluating factuality in multimodal AI systems

Combating Fake News with AI Intelligence
A Knowledge-guided Framework for Few-shot Detection

Harnessing LVLMs for Fake News Detection
How Visual-Language Models Outperform in Multimodal Misinformation Classification

Smart Danmaku Moderation for Video Platforms
Using LLMs and Impact Captions to Reduce Toxic Comments

Combating Fake News with Adaptive AI
Using dynamic analysis to detect misinformation across platforms

Bypassing LLM Safety Filters
Critical vulnerability found in safety evaluation metrics

Measuring LLM Safety Through Offensive Content Progression
A new approach to benchmarking model sensitivity to harmful content

Human-Guided Data Exploration
Collaborative AI Safety Through Interactive Data Augmentation

Lightweight Security for LLMs
Making AI safety models smaller without sacrificing performance

Proactive LLM Safety Auditing
A novel approach for detecting catastrophic AI responses before they cause harm

Reliable Guardrails for LLMs
Improving calibration of content moderation systems

Small Models, Big Security Impact
How fine-tuned SLMs outperform larger models in content moderation

Beyond Binary: Fine-Grained LLM Content Detection
Recognizing the spectrum of human-AI collaboration in text

Fighting Misinformation with Smarter Detection
A Knowledge-Based Approach Using Annotator Reliability

Detecting Out-of-Scope Questions in LLMs
New resources to prevent AI hallucinations when questions seem relevant

Fooling LLM Detectors: The Proxy Attack Strategy
How simple attacks can make AI-generated text pass as human-written

AI-Generated Fake News in the LLM Era
Assessing human and AI detection capabilities against LLM-crafted misinformation

Real-Time LLM Detection
A New Betting-Based Approach to Identify AI-Generated Content as it Arrives

The Reality Gap in LLM Text Detection
Why current detectors fail in real-world scenarios

Crossing Cultural Boundaries in Hate Speech Detection
First multimodal, multilingual hate speech dataset with multicultural annotations

Combating Online Hate with AI
Leveraging GPT-3.5 Turbo to detect and mitigate hate speech on X (Twitter)

Smarter Toxicity Detection in Memes
Using Knowledge Distillation and Infusion to Combat Online Toxicity

Developing Smarter LLM Guardrails
A flexible methodology for detecting off-topic prompts without real-world data

Fighting Rumors with Network Science
A Novel Epidemic-Inspired Approach to Rumor Detection

Smarter Security Testing for AI Image Generators
Using LLMs to systematically find vulnerabilities in text-to-image models

Securing LLMs Against Harmful Content
A Dynamic Filtering Approach Without Retraining

Uncovering Hidden Messages Online
Novel methodology for detecting coded 'dog whistles' in digital content

Combating Evolving Toxic Content
Adaptable Detection Systems for Security in LLMs

Combating Organized Disinformation
Network-informed prompt engineering for detecting astroturfing in social media

Unmasking Deceptive UI Designs
Automated Detection of Manipulation in User Interfaces

Harnessing LLMs to Combat Climate Misinformation
Using AI with human oversight to detect false climate claims

HateBench: Evaluating Hate Speech Detection Against LLM Threats
First comprehensive benchmark for testing detectors against LLM-generated hate content

Modeling Self-Destructive Reasoning in LLMs
A mathematical framework for tracking toxicity amplification in language models

Emotion-Aware Cyberbullying Detection
Beyond Binary Classification: Detecting Harassment and Defamation

Securing AI: Advanced Safety Testing for LLMs
Automating comprehensive safety evaluation with ASTRAL

The Dark Side of Persuasion
How LLMs Use Personalization and False Statistics to Change Minds

Automating Counterspeech Evaluation
A novel framework for measuring effectiveness in combating hate speech

Safeguarding AI: Pre-Deployment Testing of LLMs
Insights from External Safety Evaluation of OpenAI's o3-mini Model

Divergent Emotional Patterns in Disinformation on Social Med...
By Iván Arcos, Paolo Rosso...

Detecting LLM-Laundered Fake News
How AI-paraphrased misinformation evades current detection systems

Challenges and Innovations in LLM-Powered Fake News Detectio...
By Jingyuan Yi, Zeqiu Xu...

SafeSwitch: Smarter AI Safety Controls
Using internal activation patterns to regulate LLM behavior without sacrificing capability

Almost Surely Safe Alignment of Large Language Models at Inf...
By Xiaotong Ji, Shyam Sundhar Ramesh...

Evaluating the Blind Spots of LLM Safety
Can larger models accurately detect harm in smaller ones?

Media Bias Detector: Designing and Implementing a Tool for R...
By Jenny S Wang, Samar Haider...

LLMs and Annotation Disagreement
How AI handles ambiguity in offensive language detection

Adaptive Prompting: Ad-hoc Prompt Composition for Social Bia...
By Maximilian Spliethöver, Tim Knebler...

Balancing AI Safety and Scientific Freedom
A benchmark for evaluating LLM safety mechanisms against dual-use risks

Intent-Aware Repair for Safer LLMs
Precision-targeting toxic behaviors without compromising model capabilities

Securing LLM Interactions: The Guardrail Approach
A comprehensive safety pipeline for trustworthy AI interactions

Beyond Binary: Tackling Hate Speech Detection Challenges
Innovative approaches to handle annotator disagreement in content moderation

FLAME: Flexible LLM-Assisted Moderation Engine
By Ivan Bakulin, Ilia Kopanichuk...

Demystifying Hateful Content: Leveraging Large Multimodal Mo...
By Ming Shan Hee, Roy Ka-Wei Lee

VLDBench: Vision Language Models Disinformation Detection Be...
By Shaina Raza, Ashmal Vayani...

Unlocking LLM Transparency with Sparse Autoencoders
Optimizing interpretable features for critical classifications

Combating Misinformation with AI
Introducing HintsOfTruth: A Multimodal Dataset for Checkworthiness Detection

Securing LLMs Against Unsafe Prompts
A Novel Gradient Analysis Approach with Minimal Reference Data

AI-Powered Deception Detection in Negotiations
Using Game Theory to Unmask 'Too Good to Be True' Offers

Smart Safety Guardrails for LLMs
Optimizing security without compromising performance

Combating Hateful Memes with Advanced AI
Breakthrough in fine-tuning multimodal models for online safety

ThinkGuard: Deliberative Safety for LLMs
Enhancing AI guardrails through slow, deliberative thinking

Advancing Extreme Speech Detection
Comparing Open-Source vs. Proprietary LLMs for Content Moderation

Red Flag Tokens: A New Approach to LLM Safety
Enhancing harmfulness detection without compromising model capabilities

Explainable Propaganda Detection
Using LLMs to detect and explain propagandistic content

MemeIntel: Smart Detection of Harmful Memes
AI-powered explainable detection system for propaganda and hate speech

Detecting Disguised Toxic Content with AI
Using LLMs to Extract Effective Search Queries

Geographically-Aware Hate Speech Detection
Evaluating LLMs for culturally contextualized content moderation

Controlling Toxic AI Outputs
A statistical approach to safer large language models

Balancing Safety and Utility in AI Role-Playing
New frameworks for managing dangerous content in character simulations

Safety Risks in LLM Role-Play
Uncovering & addressing security vulnerabilities when LLMs assume character roles

Fighting Fake News with AI
Comparing LLM-based strategies for misinformation detection

Uncovering Hidden Toxicity in Language
A novel approach to detecting implicit harmful content in LLMs

Filtering Harm in LLM Training Data
Evaluating safety strategies and their implications for vulnerable groups

SafeSpeech: Detecting Toxicity Across Conversations
Beyond message-level analysis to context-aware toxic language detection

Smarter Content Moderation for LLMs
Risk-level assessment for safer AI platforms

AI-Powered Violence Detection in Historical Texts
Large Language Models Automate Analysis of Ancient Violence

Evaluating AI Models for Inclusive Computing Language
Benchmarking LLMs' ability to detect harmful technical terminology

Protecting Children in the Age of AI
New benchmark for evaluating LLM content risks for minors

Simulating Moderation at Scale
Using LLMs to evaluate online content moderation strategies

Securing Code LLMs Against Harmful Content
A novel automated framework for robust content moderation in code generation

Fixing False Refusals in AI Safety
Making LLMs smarter about when to say 'no'

Detecting Coded Islamophobia Online
Using LLMs to Identify and Analyze Extremist Language

Harnessing LLMs for Bug Report Analysis
Using AI to extract failure-inducing inputs from natural language bug reports

Breaking Language Barriers in Content Moderation
Adapting LLMs for Low-Resource Languages: The Sinhala Case Study

Enhanced Harmful Content Detection
Combining LLMs with Knowledge Graphs for Safer AI Systems

Multi-label Hate Speech Detection
Advancing beyond binary classification for more effective content moderation

Protecting AI Across Languages
A Multilingual Approach to LLM Content Moderation

Defending Against LLM Jailbreaks
A Feature-Aware Approach to Detecting and Mitigating Harmful Outputs

Detecting Sexism Across Modalities
First multimodal Spanish dataset for sexism detection in social media videos
