
Privacy-Preserving LLM Alignment
Steering language models safely with differential privacy
This research introduces Private Steering for LLM Alignment (PSA), a novel algorithm that enables safer model alignment while protecting sensitive data.
- Combines activation editing with differential privacy to prevent information leakage
- Demonstrates effectiveness in reducing harmful outputs while maintaining performance
- Develops Membership Inference Attacks to evaluate privacy risks in alignment techniques
- Achieves 95% of non-private steering performance with strong privacy guarantees
Why it matters: As organizations fine-tune LLMs for specific use cases, PSA provides a framework to align models with human values while preventing sensitive demonstration data from being extracted by attackers - a critical advance for deploying LLMs in security-sensitive environments.
Differentially Private Steering for Large Language Model Alignment