Unlocking Long-Context LLMs on Consumer Devices

Unlocking Long-Context LLMs on Consumer Devices

Efficient memory management with trained retaining heads

Locret introduces a novel approach to enable long-context LLM inference on consumer-grade devices by intelligently managing key-value cache memory.

  • Employs trained retaining heads to identify and retain only the most valuable information in context
  • Reduces memory footprint by up to 80% compared to traditional approaches
  • Achieves superior performance on long-context tasks while maintaining inference quality
  • Enables streaming input processing without excessive memory requirements

This research represents a significant engineering breakthrough that democratizes access to long-context LLMs, allowing deployment on standard laptops and mobile devices without specialized hardware.

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

80 | 521