
Accelerating MoE Models with Hybrid Computing
Smart CPU-GPU orchestration for faster LLM inference
Fiddler is a novel system that intelligently orchestrates CPU-GPU resources for efficient inference of Mixture-of-Experts (MoE) language models in memory-constrained environments.
- Reduces inference latency by intelligently offloading experts to CPU memory while keeping others in GPU
- Uses selective preloading that anticipates which experts will be needed next
- Implements overlapped execution strategy where CPU and GPU process different parts simultaneously
- Achieves up to 3.7× speedup compared to existing CPU-offloading approaches
This research is critical for Engineering teams deploying large MoE models on standard hardware, enabling efficient LLM inference without requiring expensive GPU upgrades.
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models