Optimizing LLM Inference at Scale

Optimizing LLM Inference at Scale

A hybrid offline-online scheduling approach for maximizing throughput

This research presents a novel hybrid offline-online scheduling method to optimize large language model (LLM) inference systems for maximum hardware utilization and throughput.

  • Formulates inference optimization as a mixed-integer programming (MIP) problem
  • Offline component handles large-scale scheduling challenges
  • Online component adapts to real-time operational dynamics
  • Significantly improves system efficiency and resource utilization

For Engineering teams, this approach offers practical solutions to the growing challenge of efficiently deploying LLMs in production environments, helping to reduce costs while maintaining performance at scale.

Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization

319 | 521