DiSCo: Optimizing LLM Text Streaming

DiSCo introduces a hybrid architecture that intelligently distributes LLM processing between user devices and cloud servers to optimize text streaming services.

Addresses critical QoE metrics: Time-To-First-Token (TTFT) and Time-Between-Token (TBT)
Combines on-device small models for faster initial responses with server models for higher quality completions
Achieves up to 56% lower latency while reducing operational costs
Dynamically adapts to network conditions and query complexity

This research offers engineering teams a practical framework to balance performance and cost constraints in LLM deployments, especially for applications requiring real-time interactions at scale.

DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services