Privacy-Preserving LLM Alignment

This research introduces Private Steering for LLM Alignment (PSA), a novel algorithm that enables safer model alignment while protecting sensitive data.

Combines activation editing with differential privacy to prevent information leakage
Demonstrates effectiveness in reducing harmful outputs while maintaining performance
Develops Membership Inference Attacks to evaluate privacy risks in alignment techniques
Achieves 95% of non-private steering performance with strong privacy guarantees

Why it matters: As organizations fine-tune LLMs for specific use cases, PSA provides a framework to align models with human values while preventing sensitive demonstration data from being extracted by attackers - a critical advance for deploying LLMs in security-sensitive environments.

Differentially Private Steering for Large Language Model Alignment