Unlocking LLM Speed with Existing Hardware

This research repurposes standard DRAM as a computing engine for matrix-vector operations in LLM inference, enabling faster processing without specialized hardware.

Addresses the GeMV latency bottleneck in LLM inference, even for quantized low-bit models
Leverages Processing-Using-DRAM (PUD) techniques to create a high-throughput processing solution
Enables consumer devices to run LLMs more efficiently using existing DRAM resources
Demonstrates practical implementation without requiring DRAM modifications

This engineering breakthrough matters because it could make AI acceleration more accessible and cost-effective across billions of existing devices, potentially democratizing access to LLM technology.

MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration