Optimizing LLM Serving for Hybrid Workloads

BROS is a new serving system that efficiently handles both latency-sensitive interactive requests and throughput-oriented batch processing workloads for large language models.

Achieves 2.76× higher throughput while maintaining real-time request latency requirements
Uses a dynamic token-level scheduling algorithm to interleave different request types
Implements heterogeneous batching for improved GPU utilization and memory management
Provides a practical solution for production LLM deployments with mixed workload requirements

This research addresses critical engineering challenges in deploying LLMs at scale, enabling organizations to serve multiple use cases on the same infrastructure without sacrificing performance or requiring duplicate resources.

Original Paper: Efficient LLM Serving on Hybrid Real-time and Best-effort Requests