
ProFS: A Better Way to Reduce LLM Toxicity
Making LLMs safer through targeted model editing
ProFS is a novel approach that uses model editing to reduce toxicity in Large Language Models without extensive retraining or preference data.
- Performs 10x faster than DPO with only 10% of the data
- Creates more robust models that resist generating toxic content
- Maintains model performance on safe queries while reducing harmful outputs
- Offers greater transparency and control compared to traditional preference optimization
This research is critical for security teams looking to deploy safer AI systems while maintaining efficiency and performance metrics. ProFS represents a practical way to address toxicity concerns before deployment.
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity