Breaking Barriers: FP8 Training at Trillion-Token Scale

Breaking Barriers: FP8 Training at Trillion-Token Scale

Solving critical instabilities for unprecedented LLM training efficiency

Researchers achieve groundbreaking 20x increase in FP8 precision training capacity, successfully training LLMs on up to 2 trillion tokens for the first time.

  • Identified and resolved critical instabilities in FP8 training that only emerge during extended training runs
  • Traced instability to outlier amplification by the SwiGLU activation function
  • Demonstrated both analytical and empirical evidence for precision-dependent amplification
  • Developed solutions that enable stable, efficient training at unprecedented scale

This breakthrough enables significantly more efficient LLM training with reduced memory and computational requirements, potentially democratizing access to state-of-the-art AI development capabilities.

Scaling FP8 training to trillion-token LLMs

76 | 521