ROMA: Hardware Acceleration for On-Device LLMs

ROMA introduces a novel hardware architecture designed specifically for on-device LLM deployment using QLoRA techniques, enabling enhanced privacy and real-time interaction on edge devices.

Utilizes Read-Only Memory (ROM) to store quantized base models, reducing power consumption while maintaining model performance
Implements a hybrid storage architecture optimized for the unique characteristics of QLoRA-based inference
Delivers privacy advantages by keeping user data on-device rather than sending it to cloud services
Enables real-time interaction capabilities previously challenging on resource-constrained edge devices

This research represents a significant advancement in hardware-specific LLM acceleration, addressing key challenges in deploying powerful AI capabilities directly on user devices without compromising performance or security.

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM