
Optimizing LLM Inference: CPU-GPU Architecture Analysis
Performance insights across PCIe A100/H100 and GH200 systems
This research analyzes LLM inference characteristics across different CPU-GPU architectures to optimize performance and reduce data center costs.
- Provides comprehensive performance analysis of loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems
- Identifies bottlenecks in current inference workflows through kernel-level tracing
- Demonstrates how architectural differences impact inference efficiency
- Proposes optimization techniques like kernel fusion for improved throughput
Engineering implications: As LLM workloads increasingly dominate data center resources, understanding architectural performance dynamics enables more cost-effective deployment strategies and infrastructure planning.
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures