Beyond Greedy Decoding: A Probabilistic Approach to LLMs

Beyond Greedy Decoding: A Probabilistic Approach to LLMs

Improving security evaluation for unlearning and alignment

This research introduces a probabilistic framework for more accurately evaluating language models, particularly for security-critical applications.

  • Reveals that traditional deterministic evaluations using greedy decoding fail to capture the full output distribution of LLMs
  • Proposes a novel sampling-based evaluation approach that provides more reliable security assessments
  • Demonstrates particular importance for unlearning (removing harmful knowledge) and alignment (ensuring model behavior matches human values)
  • Identifies critical gaps in current security evaluation methods that could lead to false confidence in model safety

This matters because accurate security evaluation is essential for responsible LLM deployment, helping identify vulnerabilities that deterministic methods might miss, and providing a more robust foundation for trust in AI systems.

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

6 | 51