Consistency Matters: LLMs in Sequential Interactions

This research introduces a comprehensive framework for evaluating and improving LLM response consistency across multiple interaction rounds.

Proposes a novel Position-Weighted Consistency (PWC) score that prioritizes early-stage stability
Evaluates LLMs like ChatGPT, Claude, and Llama across 20+ tasks in security, medical, and support domains
Provides practical techniques to enhance consistency without sacrificing response quality
Identifies critical factors affecting consistency: prompt design, model size, and reasoning approaches

Security Implications: For high-stakes security applications, consistent and predictable AI responses are essential to maintain user trust and system reliability. The framework offers valuable tools for evaluating LLMs before deployment in sensitive contexts.

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions