ProFS: A Better Way to Reduce LLM Toxicity

ProFS: A Better Way to Reduce LLM Toxicity

Making LLMs safer through targeted model editing

ProFS is a novel approach that uses model editing to reduce toxicity in Large Language Models without extensive retraining or preference data.

  • Performs 10x faster than DPO with only 10% of the data
  • Creates more robust models that resist generating toxic content
  • Maintains model performance on safe queries while reducing harmful outputs
  • Offers greater transparency and control compared to traditional preference optimization

This research is critical for security teams looking to deploy safer AI systems while maintaining efficiency and performance metrics. ProFS represents a practical way to address toxicity concerns before deployment.

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity

9 | 104