Supercharging LLMs with Hardware-Efficient Compression

Supercharging LLMs with Hardware-Efficient Compression

Tensor-Train Decomposition for FPGA Acceleration

This research demonstrates a novel approach to compress large language models while maintaining performance through tensor-train decomposition (TTD) implemented on FPGA hardware.

  • Achieves up to 1.94× compression for ChatGLM3-6B and 1.60× for LLaMA2-7B models
  • Implements a Group Vector Systolic Array architecture for efficient hardware acceleration
  • Reduces both storage and computational requirements for resource-constrained environments
  • Maintains model performance while significantly decreasing hardware demands

This engineering breakthrough enables deployment of powerful LLMs on resource-limited hardware, making advanced AI more accessible for edge computing and embedded systems.

A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

181 | 521