Accelerating Embedding Operations

Accelerating Embedding Operations

Optimizing AI Workloads with Decoupled Access-Execute Architecture

Ember introduces a specialized compiler that automatically generates optimized code for efficient embedding lookups on Decoupled Access-Execute (DAE) processors from PyTorch and TensorFlow.

  • Achieves 2.6× higher performance and 6.4× higher performance/watt than GPUs on end-to-end models
  • Tackles the critical bottleneck of irregular embedding lookups in recommender systems, sparse LLMs, and graph learning
  • Automatically optimizes code without requiring manual architecture-specific rewrites
  • Demonstrates significant efficiency gains through specialized hardware-software co-design

This innovation enables more efficient training and inference for embedding-heavy AI workloads, potentially reducing compute costs while increasing throughput for large-scale deployments.

Original Paper: Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

42 | 46