Smarter Memory for LLMs

WeightedKV introduces a novel approach to manage memory consumption in Large Language Models without sacrificing performance.

Merges KV cache entries using attention score weighting instead of discarding tokens
Maintains a fixed memory footprint while preserving representation of all tokens
Achieves comparable performance to full KV cache methods while using significantly less memory
Particularly valuable for long-context generation scenarios

This engineering advancement helps overcome memory bottlenecks in LLM deployment, enabling more efficient and cost-effective text generation at scale.

WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models