
Optimizing Transformer Training Speed
Leveraging Sharpness Disparity for Faster LLM Pre-training
This research identifies that different components within transformer models exhibit distinct sharpness characteristics during training, enabling novel optimization strategies.
- Discovers persistent Sharpness Disparity across transformer blocks (embeddings, attention, FFNs)
- Introduces Blockwise Learning Rate strategy that assigns different learning rates to different model components
- Demonstrates 30% faster convergence in pre-training while maintaining or improving model quality
- Requires minimal implementation effort with significant performance gains
This engineering advancement matters because it offers a practical way to reduce LLM training costs and energy consumption without architectural changes, making advanced AI more accessible and sustainable.
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training