Accelerating MoE Models with Hybrid Computing

Fiddler is a novel system that intelligently orchestrates CPU-GPU resources for efficient inference of Mixture-of-Experts (MoE) language models in memory-constrained environments.

Reduces inference latency by intelligently offloading experts to CPU memory while keeping others in GPU
Uses selective preloading that anticipates which experts will be needed next
Implements overlapped execution strategy where CPU and GPU process different parts simultaneously
Achieves up to 3.7× speedup compared to existing CPU-offloading approaches

This research is critical for Engineering teams deploying large MoE models on standard hardware, enabling efficient LLM inference without requiring expensive GPU upgrades.

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models