Optimizing LLM Inference: CPU-GPU Architecture Analysis

This research analyzes LLM inference characteristics across different CPU-GPU architectures to optimize performance and reduce data center costs.

Provides comprehensive performance analysis of loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems
Identifies bottlenecks in current inference workflows through kernel-level tracing
Demonstrates how architectural differences impact inference efficiency
Proposes optimization techniques like kernel fusion for improved throughput

Engineering implications: As LLM workloads increasingly dominate data center resources, understanding architectural performance dynamics enables more cost-effective deployment strategies and infrastructure planning.

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures