
Unlocking LLM Speed with Existing Hardware
Accelerating language models using unmodified DRAM
This research repurposes standard DRAM as a computing engine for matrix-vector operations in LLM inference, enabling faster processing without specialized hardware.
- Addresses the GeMV latency bottleneck in LLM inference, even for quantized low-bit models
- Leverages Processing-Using-DRAM (PUD) techniques to create a high-throughput processing solution
- Enables consumer devices to run LLMs more efficiently using existing DRAM resources
- Demonstrates practical implementation without requiring DRAM modifications
This engineering breakthrough matters because it could make AI acceleration more accessible and cost-effective across billions of existing devices, potentially democratizing access to LLM technology.
MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration