Optimizing LLM Survey Simulations

This research introduces a statistical framework for determining the optimal number of LLM-generated survey responses to create reliable confidence intervals for human population parameters.

Too many synthetic responses create misleadingly narrow confidence intervals
Too few responses result in excessively wide intervals
The optimal approach balances statistical efficiency with accurate uncertainty quantification
Provides mathematically rigorous methods to address distribution shifts between synthetic and real populations

For medical researchers, this framework enables more reliable use of LLM-simulated responses in clinical surveys and trials, potentially reducing costs while maintaining statistical validity when human data is scarce or expensive to collect.

Uncertainty Quantification for LLM-Based Survey Simulations