Optimizing LLM Serving for Hybrid Workloads

Optimizing LLM Serving for Hybrid Workloads

A novel system for balancing real-time and batch processing requests

BROS is a new serving system that efficiently handles both latency-sensitive interactive requests and throughput-oriented batch processing workloads for large language models.

  • Achieves 2.76× higher throughput while maintaining real-time request latency requirements
  • Uses a dynamic token-level scheduling algorithm to interleave different request types
  • Implements heterogeneous batching for improved GPU utilization and memory management
  • Provides a practical solution for production LLM deployments with mixed workload requirements

This research addresses critical engineering challenges in deploying LLMs at scale, enabling organizations to serve multiple use cases on the same infrastructure without sacrificing performance or requiring duplicate resources.

Original Paper: Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

508 | 521