Making LLMs More Predictable

Representation Engineering offers a systematic way to detect and edit high-level concepts within large language models, making AI behavior more predictable and safe.

Identifies and manipulates internal representations of concepts like honesty and harmfulness
Uses contrasting input samples to reveal how models encode abstract concepts
Provides methods to edit these representations without full retraining
Creates more tractable and controllable AI systems

This research addresses critical security implications by enabling developers to detect and mitigate potentially harmful behaviors before deployment, reducing unforeseen risks in advanced AI systems.

Representation Engineering for Large-Language Models: Survey and Research Challenges