Filtering Harm in LLM Training Data

This research evaluates data filtering strategies used to remove harmful content from LLM pretraining datasets, with special attention to impacts on vulnerable communities.

Systematically benchmarks current harm reduction filtering approaches
Examines unintended consequences for representation of marginalized groups
Identifies gaps between filtering intentions and actual effectiveness
Recommends more nuanced approaches to balance safety and inclusion

Why it matters: As LLMs become increasingly integrated into critical systems, understanding how safety filters might inadvertently reinforce biases or exclude legitimate content is essential for developing more equitable AI.

What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets