Accelerating Embedding Operations

Ember introduces a specialized compiler that automatically generates optimized code for efficient embedding lookups on Decoupled Access-Execute (DAE) processors from PyTorch and TensorFlow.

Achieves 2.6× higher performance and 6.4× higher performance/watt than GPUs on end-to-end models
Tackles the critical bottleneck of irregular embedding lookups in recommender systems, sparse LLMs, and graph learning
Automatically optimizes code without requiring manual architecture-specific rewrites
Demonstrates significant efficiency gains through specialized hardware-software co-design

This innovation enables more efficient training and inference for embedding-heavy AI workloads, potentially reducing compute costs while increasing throughput for large-scale deployments.

Original Paper: Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures