Optimizing LLM Inference at Scale

This research presents a novel hybrid offline-online scheduling method to optimize large language model (LLM) inference systems for maximum hardware utilization and throughput.

Formulates inference optimization as a mixed-integer programming (MIP) problem
Offline component handles large-scale scheduling challenges
Online component adapts to real-time operational dynamics
Significantly improves system efficiency and resource utilization

For Engineering teams, this approach offers practical solutions to the growing challenge of efficiently deploying LLMs in production environments, helping to reduce costs while maintaining performance at scale.

Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization