Breaking Barriers: FP8 Training at Trillion-Token Scale

Researchers achieve groundbreaking 20x increase in FP8 precision training capacity, successfully training LLMs on up to 2 trillion tokens for the first time.

Identified and resolved critical instabilities in FP8 training that only emerge during extended training runs
Traced instability to outlier amplification by the SwiGLU activation function
Demonstrated both analytical and empirical evidence for precision-dependent amplification
Developed solutions that enable stable, efficient training at unprecedented scale

This breakthrough enables significantly more efficient LLM training with reduced memory and computational requirements, potentially democratizing access to state-of-the-art AI development capabilities.

Scaling FP8 training to trillion-token LLMs