MIT and NVIDIA Boost LLM Speed 6x with Diffusion Model
Researchers from MIT and NVIDIA have introduced a breakthrough in large language model (LLM) inference speed, achieving a sixfold increase without compromising quality. The innovation, detailed in the paper “DFlash” by Zhijian Liu’s group, replaces the traditional autoregressive draft model in speculative decoding with a diffusion model. This approach allows for parallel generation of candidate tokens, significantly enhancing processing speed.
The DFlash model conditions on hidden states from the target LLM, ensuring high acceptance rates despite its novel architecture. It delivers a 2.5x speed improvement over the current state-of-the-art EAGLE-3 model, while requiring significantly fewer training samples. The model is drop-in compatible, requiring no changes to existing inference stacks, making it a practical solution for real-time applications and cost-effective at scale. This development highlights the potential of diffusion models in text processing, leveraging their strength in parallelism.