Optimizing LLM Serving with Slice-Level Scheduling

Optimizing LLM Serving with Slice-Level Scheduling

Achieving Higher Throughput and Better Load Balancing for AI Applications

This research introduces a slice-level scheduling (SNS) approach that significantly improves LLM serving efficiency compared to traditional sequence-level scheduling methods.

  • Tackles the challenge of unpredictable request generation lengths and memory requirements in LLM serving
  • Achieves higher throughput and better load balancing across servers
  • Dynamically adjusts batching based on real-time resource availability rather than static batching
  • Enables more efficient resource utilization by processing requests in smaller, manageable slices

For engineering teams, this innovation enables more cost-effective deployment of LLM-powered applications while maintaining or improving response times and quality of service.

Original Paper: Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

44 | 521