ProFS: A Better Way to Reduce LLM Toxicity

ProFS is a novel approach that uses model editing to reduce toxicity in Large Language Models without extensive retraining or preference data.

Performs 10x faster than DPO with only 10% of the data
Creates more robust models that resist generating toxic content
Maintains model performance on safe queries while reducing harmful outputs
Offers greater transparency and control compared to traditional preference optimization

This research is critical for security teams looking to deploy safer AI systems while maintaining efficiency and performance metrics. ProFS represents a practical way to address toxicity concerns before deployment.

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity