Privacy-Preserving LLM Alignment

Privacy-Preserving LLM Alignment

Steering language models safely with differential privacy

This research introduces Private Steering for LLM Alignment (PSA), a novel algorithm that enables safer model alignment while protecting sensitive data.

  • Combines activation editing with differential privacy to prevent information leakage
  • Demonstrates effectiveness in reducing harmful outputs while maintaining performance
  • Develops Membership Inference Attacks to evaluate privacy risks in alignment techniques
  • Achieves 95% of non-private steering performance with strong privacy guarantees

Why it matters: As organizations fine-tune LLMs for specific use cases, PSA provides a framework to align models with human values while preventing sensitive demonstration data from being extracted by attackers - a critical advance for deploying LLMs in security-sensitive environments.

Differentially Private Steering for Large Language Model Alignment

62 | 125