
Consistency Matters: LLMs in Sequential Interactions
New metrics and methods to ensure reliable AI responses over multiple turns
This research introduces a comprehensive framework for evaluating and improving LLM response consistency across multiple interaction rounds.
- Proposes a novel Position-Weighted Consistency (PWC) score that prioritizes early-stage stability
- Evaluates LLMs like ChatGPT, Claude, and Llama across 20+ tasks in security, medical, and support domains
- Provides practical techniques to enhance consistency without sacrificing response quality
- Identifies critical factors affecting consistency: prompt design, model size, and reasoning approaches
Security Implications: For high-stakes security applications, consistent and predictable AI responses are essential to maintain user trust and system reliability. The framework offers valuable tools for evaluating LLMs before deployment in sensitive contexts.
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions