Optimizing LLM Inference: CPU-GPU Architecture Analysis

Optimizing LLM Inference: CPU-GPU Architecture Analysis

Performance insights across PCIe A100/H100 and GH200 systems

This research analyzes LLM inference characteristics across different CPU-GPU architectures to optimize performance and reduce data center costs.

  • Provides comprehensive performance analysis of loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems
  • Identifies bottlenecks in current inference workflows through kernel-level tracing
  • Demonstrates how architectural differences impact inference efficiency
  • Proposes optimization techniques like kernel fusion for improved throughput

Engineering implications: As LLM workloads increasingly dominate data center resources, understanding architectural performance dynamics enables more cost-effective deployment strategies and infrastructure planning.

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

44 | 46