Beyond Greedy Decoding: A Probabilistic Approach to LLMs

This research introduces a probabilistic framework for more accurately evaluating language models, particularly for security-critical applications.

Reveals that traditional deterministic evaluations using greedy decoding fail to capture the full output distribution of LLMs
Proposes a novel sampling-based evaluation approach that provides more reliable security assessments
Demonstrates particular importance for unlearning (removing harmful knowledge) and alignment (ensuring model behavior matches human values)
Identifies critical gaps in current security evaluation methods that could lead to false confidence in model safety

This matters because accurate security evaluation is essential for responsible LLM deployment, helping identify vulnerabilities that deterministic methods might miss, and providing a more robust foundation for trust in AI systems.

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models