Accelerating LLMs with Hierarchical Drafting

This research introduces a novel approach to speed up Large Language Model inference without sacrificing quality or requiring model fine-tuning.

Leverages temporal locality patterns in token generation to create a hierarchical drafting system
Achieves 1.7-2.0× speedup compared to traditional autoregressive decoding
Maintains 100% accuracy (lossless) while requiring no fine-tuning of the base model
Demonstrates consistent performance across different tasks and model sizes

For engineering teams, this method offers immediate practical value by reducing inference latency in production LLM systems without the computational cost of model retraining or accuracy tradeoffs.

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding