Cache Me If You Must: Adaptive Key-Value Quantization for La...

Abstract:

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector represen...

Key points:

Research on large language models
Engineering application

Source: Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models