Accelerating LLM Decoding with PAPI

Accelerating LLM Decoding with PAPI

A Dynamic PIM-Enabled Architecture for Efficient Token Generation

PAPI introduces a Processing-In-Memory (PIM) enabled computing system that dynamically schedules compute-bound and memory-bound kernels for optimal LLM decoding performance.

  • Leverages dynamic parallelism to intelligently distribute workloads across traditional processors and PIM units
  • Achieves up to 2.0× speedup over state-of-the-art architectures by efficiently handling the varying computational needs of LLM decoding
  • Proposes a novel runtime scheduler that adapts to kernel characteristics during execution
  • Provides a practical solution for reducing inference latency in production LLM deployments

This research addresses a critical engineering bottleneck in LLM inference, potentially enabling faster and more efficient AI applications across industries.

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

315 | 521