
DiSCo: Optimizing LLM Text Streaming
A device-server collaborative approach for better performance and lower costs
DiSCo introduces a hybrid architecture that intelligently distributes LLM processing between user devices and cloud servers to optimize text streaming services.
- Addresses critical QoE metrics: Time-To-First-Token (TTFT) and Time-Between-Token (TBT)
- Combines on-device small models for faster initial responses with server models for higher quality completions
- Achieves up to 56% lower latency while reducing operational costs
- Dynamically adapts to network conditions and query complexity
This research offers engineering teams a practical framework to balance performance and cost constraints in LLM deployments, especially for applications requiring real-time interactions at scale.
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services