Improving LLM Reliability in Social Sciences

This research adapts established survey methodology principles to systematically assess and improve the reliability of Large Language Model annotations in social science research.

Implements three key interventions: option randomization, position randomization, and reverse validation
Reveals how traditional accuracy metrics can mask model instabilities, especially in edge cases
Demonstrates framework effectiveness using the F1000 biomedical dataset
Provides a structured approach to evaluate LLM annotation reliability beyond simple accuracy metrics

For medical research, this framework enables more reliable LLM-based analysis of biomedical literature and clinical notes by identifying and mitigating biases that traditional evaluation methods might miss.

Old Experience Helps: Leveraging Survey Methodology to Improve AI Text Annotation Reliability in Social Sciences