Smarter LLM Scheduling for Mixed Workloads

QLLM introduces a priority-aware scheduling system for Mixture of Experts (MoE) models that efficiently handles mixed workloads in data centers.

Addresses head-of-line blocking by allowing latency-sensitive jobs to preempt best-effort jobs
Uses fine-grained preemption specifically designed for MoE architecture
Achieves better throughput while maintaining service level objectives
Demonstrates practical improvements in real-world inference scenarios

This research matters because it enables data centers to efficiently serve multiple types of LLM workloads simultaneously, maximizing resource utilization while ensuring critical applications receive priority treatment.

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference