Optimizing LLM Serving Resources

Optimizing LLM Serving Resources

Balancing GPU Compute and Key-Value Cache for Efficient LLM Deployment

EconoServe introduces a novel scheduler that maximizes both GPU compute and Key-Value Cache utilization while maintaining Service Level Objectives for LLM serving systems.

  • Addresses the critical challenge of simultaneous optimization of multiple resources in LLM serving
  • Ensures timely allocation of Key-Value Cache when needed by request batches
  • Delivers higher throughput compared to existing schedulers that optimize only single resources
  • Maintains strict SLO guarantees while reducing operational costs

This research enables engineering teams to deploy large language models more cost-effectively at scale, addressing growing concerns about GPU resource constraints and operational efficiency in production environments.

EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

112 | 521