
Optimizing MoE LLM Deployment
Maximizing throughput under hardware constraints
MoE-Lens presents a novel system to optimize Mixture-of-Experts LLM deployment on limited GPU memory by intelligently managing GPU-CPU resources.
- Addresses the challenge of fitting large MoE models in constrained environments
- Introduces a performance model to identify throughput bottlenecks
- Provides a 67% throughput improvement over baseline approaches
- Enables efficient scaling by managing expert placement across CPU-GPU boundaries
This research offers practical solutions for engineering teams deploying large MoE models in production environments with limited resources, helping bridge the gap between model scale and hardware constraints.
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints