Accelerating MoE Models with Hybrid Computing

Accelerating MoE Models with Hybrid Computing

Smart CPU-GPU orchestration for faster LLM inference

Fiddler is a novel system that intelligently orchestrates CPU-GPU resources for efficient inference of Mixture-of-Experts (MoE) language models in memory-constrained environments.

  • Reduces inference latency by intelligently offloading experts to CPU memory while keeping others in GPU
  • Uses selective preloading that anticipates which experts will be needed next
  • Implements overlapped execution strategy where CPU and GPU process different parts simultaneously
  • Achieves up to 3.7× speedup compared to existing CPU-offloading approaches

This research is critical for Engineering teams deploying large MoE models on standard hardware, enabling efficient LLM inference without requiring expensive GPU upgrades.

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

13 | 521