Rethinking LLMs for Type Inference in Code

This research reveals significant challenges in how we evaluate LLMs for inferring types in Java code snippets, particularly when using public datasets.

Key findings:

LLMs perform significantly worse on a new, uncontaminated dataset (27-38% lower accuracy)
Data contamination artificially inflates performance metrics in previous studies
Model size correlates with memorization ability rather than true inference capability
Type-specific prompting improves performance by 12.9%

For engineering teams, this highlights the importance of proper benchmark design when evaluating AI code assistants and demonstrates the real challenges in building tools that can accurately infer types in unfamiliar code snippets.

Beyond Memorization: Evaluating the True Type Inference Capabilities of LLMs for Java Code Snippets