Comparison with synchronous speculative decoding (left) with AMUSD (right). The synchronous approach alternatives between drafting and verifying phases, meaning only model can work at a time. AMUSD uses asynchronous generation running on multiple GPUs enable continuous generation. The draft model must rollback invalid tokens that conflict with the verify model.

Comparison with synchronous speculative decoding (left) with AMUSD (right). The synchronous approach alternatives between drafting and verifying phases, meaning only model can work at a time. AMUSD uses asynchronous generation running on multiple GPUs enable continuous generation. The draft model must rollback invalid tokens that conflict with the verify model.

Source publication
Preprint
Full-text available
Large language models typically generate tokens autoregressively, using each token as input for the next. Recent work on Speculative Decoding has sought to accelerate this process by employing a smaller, faster draft model to more quickly generate candidate tokens. These candidates are then verified in parallel by the larger (original) verify model...

Contexts in source publication

Context 1
... Speculative Decoding has shown promise in accelerating language model inference, it still faces limitations due to its synchronous nature. As illustrated in Figure 1 (left), traditional speculative decoding alternates between distinct drafting and verifying phases, allowing only one model to perform computations at any given time. To overcome this constraint, we propose Asynchronous Multi-process Specu- lative Decoding (AMUSD), a novel system that decouples these phases into continuous, parallel operations. ...
Context 2
... overcome this constraint, we propose Asynchronous Multi-process Specu- lative Decoding (AMUSD), a novel system that decouples these phases into continuous, parallel operations. As shown in Figure 1 (right), AMUSD leverages multiple devices (e.g., GPUs) to enable simultaneous and independent predictions from both the draft and verify models. This asynchronous approach allows for continuous generation, with the draft model producing candidate tokens while the verify model concurrently validates previously generated sequences. ...

Similar publications

Preprint
Full-text available
Enhancing the reasoning capabilities of large language models (LLMs) is crucial for enabling them to tackle complex, multi-step problems. Multi-agent frameworks have shown great potential in enhancing LLMs' reasoning capabilities. However, the lack of effective cooperation between LLM agents hinders their performance, especially for multi-step reas...