Learning to Keep a Promise: Scaling Language Model Decoding ...

Abstract:

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these technique...

Key points:

Research on large language models
Engineering application

Source: Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding