Breaking the Context Length Barrier for LLMs

Medha is a novel system that enables efficient serving of LLM requests with extremely long contexts (millions of tokens) while maintaining high performance without simplifying approximations.

Introduces Adaptive Chunking to dynamically adjust token grouping based on inference phase
Implements Sequence Pipeline Parallelism to distribute computation across GPUs while preserving sequence integrity
Employs KV Cache Parallelism to handle the memory demands of multi-million token contexts
Achieves near-constant Time to First Token (TTFT) and Time per Output Token (TPOT) regardless of context length

This research is significant for Engineering teams developing inference systems that need to handle extremely long documents or conversations without degrading performance or resorting to approximations that might reduce accuracy.

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations