
Accelerating LLM Decoding with PAPI
A Dynamic PIM-Enabled Architecture for Efficient Token Generation
PAPI introduces a Processing-In-Memory (PIM) enabled computing system that dynamically schedules compute-bound and memory-bound kernels for optimal LLM decoding performance.
- Leverages dynamic parallelism to intelligently distribute workloads across traditional processors and PIM units
- Achieves up to 2.0× speedup over state-of-the-art architectures by efficiently handling the varying computational needs of LLM decoding
- Proposes a novel runtime scheduler that adapts to kernel characteristics during execution
- Provides a practical solution for reducing inference latency in production LLM deployments
This research addresses a critical engineering bottleneck in LLM inference, potentially enabling faster and more efficient AI applications across industries.