Accelerating LLM Training Across Distributed Data Centers

Accelerating LLM Training Across Distributed Data Centers

Layer-wise Scheduling for Efficient Data Parallel Training

DreamDDP introduces a novel approach that significantly reduces communication bottlenecks in geo-distributed LLM training while maintaining model accuracy.

  • Implements layer-wise scheduled partial synchronization that reduces communication overhead by up to 50%
  • Strategically synchronizes different layers at varied frequencies based on their convergence properties
  • Maintains model accuracy comparable to full synchronization while dramatically improving training efficiency
  • Addresses growing concerns around data privacy by enabling efficient training across geographically separated data centers

This research advances distributed systems engineering by optimizing the fundamental trade-off between communication costs and model convergence, making large-scale LLM training more practical across bandwidth-constrained environments.

DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization

269 | 521