Accelerating LLMs with Zero-Overhead Memory Compression

ZACK introduces a novel approach to dimensionality compression for LLM inference that eliminates processing overhead while reducing memory constraints.

Achieves zero-overhead compression and decompression of the Key-Value Cache
Reduces attention computation time while maintaining model quality
Can be combined with existing methods (eviction, quantization) for enhanced compression
Employs adaptive compression across attention heads based on importance

This engineering breakthrough addresses a critical bottleneck in LLM deployment, enabling faster inference with lower memory requirements—essential for practical, cost-effective AI applications.

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache