Filtering Harm in LLM Training Data

Filtering Harm in LLM Training Data

Evaluating safety strategies and their implications for vulnerable groups

This research evaluates data filtering strategies used to remove harmful content from LLM pretraining datasets, with special attention to impacts on vulnerable communities.

  • Systematically benchmarks current harm reduction filtering approaches
  • Examines unintended consequences for representation of marginalized groups
  • Identifies gaps between filtering intentions and actual effectiveness
  • Recommends more nuanced approaches to balance safety and inclusion

Why it matters: As LLMs become increasingly integrated into critical systems, understanding how safety filters might inadvertently reinforce biases or exclude legitimate content is essential for developing more equitable AI.

What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets

86 | 104