Thanos: Efficient LLM Compression

Thanos introduces a novel block-wise weight-pruning algorithm that significantly reduces LLM size and computational requirements while preserving model performance.

Implements adaptive masks that dynamically adjust to weight importance
Enables flexible sparsity patterns and structured formats optimized for hardware acceleration
Achieves effective $n:m$ sparsity for improved computational efficiency
Addresses critical deployment challenges for resource-constrained environments

This research advances engineering capabilities for deploying powerful AI models in settings with limited computational resources, making LLM technology more accessible and practical for real-world applications.

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression