Smarter LLM Scheduling for Mixed Workloads

Smarter LLM Scheduling for Mixed Workloads

Preemptive prioritization for MoE models improves service quality

QLLM introduces a priority-aware scheduling system for Mixture of Experts (MoE) models that efficiently handles mixed workloads in data centers.

  • Addresses head-of-line blocking by allowing latency-sensitive jobs to preempt best-effort jobs
  • Uses fine-grained preemption specifically designed for MoE architecture
  • Achieves better throughput while maintaining service level objectives
  • Demonstrates practical improvements in real-world inference scenarios

This research matters because it enables data centers to efficiently serve multiple types of LLM workloads simultaneously, maximizing resource utilization while ensuring critical applications receive priority treatment.

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

388 | 521