
Cache Me If You Must: Adaptive Key-Value Quantization for La...
By Alina Shutova, Vladimir Malinovskii...
Abstract:
Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector represen...
Key points:
- Research on large language models
- Engineering application
Source: Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models