
Filtering Harm in LLM Training Data
Evaluating safety strategies and their implications for vulnerable groups
This research evaluates data filtering strategies used to remove harmful content from LLM pretraining datasets, with special attention to impacts on vulnerable communities.
- Systematically benchmarks current harm reduction filtering approaches
- Examines unintended consequences for representation of marginalized groups
- Identifies gaps between filtering intentions and actual effectiveness
- Recommends more nuanced approaches to balance safety and inclusion
Why it matters: As LLMs become increasingly integrated into critical systems, understanding how safety filters might inadvertently reinforce biases or exclude legitimate content is essential for developing more equitable AI.