Next-Gen LLM Inference Architecture

HERMES introduces a novel simulator for heterogeneous multi-stage LLM execution, addressing the growing complexity of modern AI inference pipelines.

Extends beyond traditional prefill-decode to model complex workflows like RAG, KV cache retrieval, and multi-step reasoning
Enables accurate performance prediction and bottleneck identification across diverse hardware (GPUs, ASICs, CPUs)
Provides critical insights for system designers to optimize resource allocation and hardware-software co-design
Supports real-world deployments where different inference stages have vastly different computational profiles

This research helps engineering teams build more efficient, cost-effective AI infrastructure by better understanding the performance characteristics of complex, distributed inference systems.

Understanding and Optimizing Multi-Stage AI Inference Pipelines