Thanos: Efficient LLM Compression

Thanos: Efficient LLM Compression

A block-wise pruning approach that maintains accuracy while reducing model size

Thanos introduces a novel block-wise weight-pruning algorithm that significantly reduces LLM size and computational requirements while preserving model performance.

  • Implements adaptive masks that dynamically adjust to weight importance
  • Enables flexible sparsity patterns and structured formats optimized for hardware acceleration
  • Achieves effective $n:m$ sparsity for improved computational efficiency
  • Addresses critical deployment challenges for resource-constrained environments

This research advances engineering capabilities for deploying powerful AI models in settings with limited computational resources, making LLM technology more accessible and practical for real-world applications.

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

486 | 521