
ROMA: Hardware Acceleration for On-Device LLMs
A ROM-based accelerator enabling efficient edge deployment of large language models
ROMA introduces a novel hardware architecture designed specifically for on-device LLM deployment using QLoRA techniques, enabling enhanced privacy and real-time interaction on edge devices.
- Utilizes Read-Only Memory (ROM) to store quantized base models, reducing power consumption while maintaining model performance
- Implements a hybrid storage architecture optimized for the unique characteristics of QLoRA-based inference
- Delivers privacy advantages by keeping user data on-device rather than sending it to cloud services
- Enables real-time interaction capabilities previously challenging on resource-constrained edge devices
This research represents a significant advancement in hardware-specific LLM acceleration, addressing key challenges in deploying powerful AI capabilities directly on user devices without compromising performance or security.
ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM