
Next-Gen LLM Inference Architecture
Simulating and optimizing multi-stage AI pipelines for heterogeneous hardware
HERMES introduces a novel simulator for heterogeneous multi-stage LLM execution, addressing the growing complexity of modern AI inference pipelines.
- Extends beyond traditional prefill-decode to model complex workflows like RAG, KV cache retrieval, and multi-step reasoning
- Enables accurate performance prediction and bottleneck identification across diverse hardware (GPUs, ASICs, CPUs)
- Provides critical insights for system designers to optimize resource allocation and hardware-software co-design
- Supports real-world deployments where different inference stages have vastly different computational profiles
This research helps engineering teams build more efficient, cost-effective AI infrastructure by better understanding the performance characteristics of complex, distributed inference systems.
Understanding and Optimizing Multi-Stage AI Inference Pipelines