Cache Me If You Must: Adaptive Key-Value Quantization for La...

Cache Me If You Must: Adaptive Key-Value Quantization for La...

By Alina Shutova, Vladimir Malinovskii...

Abstract:

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector represen...

Key points:

  • Research on large language models
  • Engineering application

Source: Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

184 | 521