
Securing LLMs from Toxic Training Data
A Data Attribution Approach to Finding & Filtering Unsafe Content
DABUF offers an efficient method to identify and filter unsafe training data without requiring expensive moderation classifiers.
- Proactive safety approach through data attribution to trace model outputs back to training examples
- 95% effectiveness at detecting safety issues while being 60× faster than traditional methods
- Taxonomy-free design adapts to evolving safety concerns without predefined unsafe categories
- Minimal resource requirements make it accessible for broader implementation
This research significantly advances LLM security by enabling developers to efficiently purge harmful content from training data, preventing models from learning toxic behaviors in the first place.
Detecting and Filtering Unsafe Training Data via Data Attribution