
Engineering Reliable LLM Accelerators
Statistical fault tolerance without compromising performance
ReaLM introduces a novel statistical approach to fault tolerance in LLM hardware accelerators, dramatically reducing overhead while maintaining reliability.
- Analyzes the inherent fault tolerance of LLMs to determine which computational steps are most vulnerable to hardware faults
- Implements selective protection using algorithm-based fault tolerance (ABFT) only on critical operations
- Achieves 99.8% fault detection rate while reducing overhead by 62.3% compared to conventional methods
- Demonstrates 1.47× speedup and energy savings of 30.1% for LLM inference
This research enables more efficient and reliable LLM deployment in resource-constrained environments, addressing a critical challenge for widespread AI application in embedded systems.