Unlocking Long-Context LLMs on Consumer Devices

Locret introduces a novel approach to enable long-context LLM inference on consumer-grade devices by intelligently managing key-value cache memory.

Employs trained retaining heads to identify and retain only the most valuable information in context
Reduces memory footprint by up to 80% compared to traditional approaches
Achieves superior performance on long-context tasks while maintaining inference quality
Enables streaming input processing without excessive memory requirements

This research represents a significant engineering breakthrough that democratizes access to long-context LLMs, allowing deployment on standard laptops and mobile devices without specialized hardware.

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices