Accelerating LLM Inference with ProMoE

Accelerating LLM Inference with ProMoE

Optimizing MoE models through proactive expert caching

ProMoE introduces a novel approach for efficient Mixture-of-Experts (MoE) model serving that reduces GPU memory requirements without sacrificing performance.

  • Achieves 34-47% lower latency than reactive caching approaches by predicting which experts will be needed in advance
  • Implements a two-level expert cache design with lightweight reinforcement learning for prediction
  • Enables parallel processing of input tokens for faster inference while maintaining accuracy
  • Supports efficient deployment on edge devices with limited GPU memory

This engineering advancement significantly improves the practical deployment of large MoE models in resource-constrained environments, making powerful AI more accessible for real-time applications.

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

109 | 521