Optimizing MoE LLM Deployment

MoE-Lens presents a novel system to optimize Mixture-of-Experts LLM deployment on limited GPU memory by intelligently managing GPU-CPU resources.

Addresses the challenge of fitting large MoE models in constrained environments
Introduces a performance model to identify throughput bottlenecks
Provides a 67% throughput improvement over baseline approaches
Enables efficient scaling by managing expert placement across CPU-GPU boundaries

This research offers practical solutions for engineering teams deploying large MoE models in production environments with limited resources, helping bridge the gap between model scale and hardware constraints.

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints