Supercharging LLMs with Hardware-Efficient Compression

This research demonstrates a novel approach to compress large language models while maintaining performance through tensor-train decomposition (TTD) implemented on FPGA hardware.

Achieves up to 1.94× compression for ChatGLM3-6B and 1.60× for LLaMA2-7B models
Implements a Group Vector Systolic Array architecture for efficient hardware acceleration
Reduces both storage and computational requirements for resource-constrained environments
Maintains model performance while significantly decreasing hardware demands

This engineering breakthrough enables deployment of powerful LLMs on resource-limited hardware, making advanced AI more accessible for edge computing and embedded systems.

A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator