MIT and NVIDIA Boost LLM Speed 6x with Diffusion Model

2 weeks agoLast Updated: 8 February، 2026

Researchers from MIT and NVIDIA have introduced a breakthrough in large language model (LLM) inference speed, achieving a sixfold increase without compromising quality. The innovation, detailed in the paper “DFlash” by Zhijian Liu’s group, replaces the traditional autoregressive draft model in speculative decoding with a diffusion model. This approach allows for parallel generation of candidate tokens, significantly enhancing processing speed.

The DFlash model conditions on hidden states from the target LLM, ensuring high acceptance rates despite its novel architecture. It delivers a 2.5x speed improvement over the current state-of-the-art EAGLE-3 model, while requiring significantly fewer training samples. The model is drop-in compatible, requiring no changes to existing inference stacks, making it a practical solution for real-time applications and cost-effective at scale. This development highlights the potential of diffusion models in text processing, leveraging their strength in parallelism.

MIT and NVIDIA Boost LLM Speed 6x with Diffusion Model

China’s Document No. 42 Tightens Control on Overseas RWAs

Retail Bitcoin Holdings Surge Amid Large Wallet Reduction

Whale Shorts 5,000 ETH and 50 BTC, Faces Losses

BlackRock Files for Ethereum Staking ETF with 95% Staking Pl

Crypto Market Erases 2024/2025 Gains

Related News

China’s Document No. 42 Tightens Control on…

Retail Bitcoin Holdings Surge Amid Large Wallet…

Whale Shorts 5,000 ETH and 50 BTC,…

BlackRock Files for Ethereum Staking ETF with…