
Supercharging LLMs with Hardware-Efficient Compression
Tensor-Train Decomposition for FPGA Acceleration
This research demonstrates a novel approach to compress large language models while maintaining performance through tensor-train decomposition (TTD) implemented on FPGA hardware.
- Achieves up to 1.94× compression for ChatGLM3-6B and 1.60× for LLaMA2-7B models
- Implements a Group Vector Systolic Array architecture for efficient hardware acceleration
- Reduces both storage and computational requirements for resource-constrained environments
- Maintains model performance while significantly decreasing hardware demands
This engineering breakthrough enables deployment of powerful LLMs on resource-limited hardware, making advanced AI more accessible for edge computing and embedded systems.
A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator