Measuring the True Difficulty of Code Tasks

TaskEval introduces a framework that accurately assesses how well large language models perform on code generation tasks by measuring their true difficulty.

Uses diverse prompts instead of single instructions to evaluate model performance
Applies Item Response Theory to calculate true task difficulty independent of prompt wording
Provides more accurate benchmark comparisons between different LLMs
Delivers insights on model strengths and weaknesses across varying task complexities

Educational Impact: TaskEval creates more meaningful programming assessments for CS education, enabling better learning tools and more accurate student evaluation by understanding the true difficulty of coding problems.

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models