
Boosting LLM Efficiency with Model-Attention Disaggregation
A novel architecture for optimized LLM deployment on heterogeneous hardware
Lamina is a new system that significantly improves LLM serving efficiency by disaggregating model components to better match hardware capabilities.
- Addresses the inefficient use of expensive accelerators during LLM decoding
- Implements model-attention disaggregation to distribute workloads optimally across heterogeneous hardware
- Achieves up to 1.8× throughput improvement compared to traditional monolithic approaches
- Enables more cost-effective deployment of large language models in production environments
This research matters for Engineering teams because it provides a practical approach to overcome hardware bottlenecks in LLM serving, potentially reducing infrastructure costs while maintaining performance.
Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation