
Breaking Barriers: FP8 Training at Trillion-Token Scale
Solving critical instabilities for unprecedented LLM training efficiency
Researchers achieve groundbreaking 20x increase in FP8 precision training capacity, successfully training LLMs on up to 2 trillion tokens for the first time.
- Identified and resolved critical instabilities in FP8 training that only emerge during extended training runs
- Traced instability to outlier amplification by the SwiGLU activation function
- Demonstrated both analytical and empirical evidence for precision-dependent amplification
- Developed solutions that enable stable, efficient training at unprecedented scale
This breakthrough enables significantly more efficient LLM training with reduced memory and computational requirements, potentially democratizing access to state-of-the-art AI development capabilities.