Next-Gen LLM Inference Architecture

Next-Gen LLM Inference Architecture

Simulating and optimizing multi-stage AI pipelines for heterogeneous hardware

HERMES introduces a novel simulator for heterogeneous multi-stage LLM execution, addressing the growing complexity of modern AI inference pipelines.

  • Extends beyond traditional prefill-decode to model complex workflows like RAG, KV cache retrieval, and multi-step reasoning
  • Enables accurate performance prediction and bottleneck identification across diverse hardware (GPUs, ASICs, CPUs)
  • Provides critical insights for system designers to optimize resource allocation and hardware-software co-design
  • Supports real-world deployments where different inference stages have vastly different computational profiles

This research helps engineering teams build more efficient, cost-effective AI infrastructure by better understanding the performance characteristics of complex, distributed inference systems.

Understanding and Optimizing Multi-Stage AI Inference Pipelines

510 | 521