
Accelerating Embedding Operations
Optimizing AI Workloads with Decoupled Access-Execute Architecture
Ember introduces a specialized compiler that automatically generates optimized code for efficient embedding lookups on Decoupled Access-Execute (DAE) processors from PyTorch and TensorFlow.
- Achieves 2.6× higher performance and 6.4× higher performance/watt than GPUs on end-to-end models
- Tackles the critical bottleneck of irregular embedding lookups in recommender systems, sparse LLMs, and graph learning
- Automatically optimizes code without requiring manual architecture-specific rewrites
- Demonstrates significant efficiency gains through specialized hardware-software co-design
This innovation enables more efficient training and inference for embedding-heavy AI workloads, potentially reducing compute costs while increasing throughput for large-scale deployments.