Learning from Large Language Diffusion Models

Why I picked this paper

I used to think language models had to be autoregressive. Next token, next token, forever. Then I found a paper that says you can do language with diffusion instead. That idea felt wrong to me at first, so I decided to read it carefully.

The paper is Large Language Diffusion Models (LLaDA). It claims you can train a language model using a diffusion-style objective and still get strong reasoning and instruction following.

The basic idea in my own words

Diffusion in images works like this. You add noise to clean images, then train a model to remove the noise step by step. For text, you cannot add Gaussian noise, so the authors use masking as the corruption process.

So the pipeline is:

Forward process: randomly mask tokens in the input.
Reverse process: predict the masked tokens.

Figure 1: The diffusion process for text uses masking instead of Gaussian noise. Clean text is progressively masked (forward), then the model learns to unmask (reverse).

That sounds like BERT. The difference here is that the authors build it as a generative diffusion process with a likelihood objective, and then scale it like an LLM.

A tiny bit of math intuition

The model is trained to maximize a lower bound on the data likelihood, which is standard in diffusion models. I do not need the full derivation, but the important intuition is:

There is a sequence of corrupted versions of the text, from clean to heavily masked.
The model learns to go backwards, from more corrupted to less corrupted.

So the reverse model is the real engine. It predicts missing tokens under a learned probability distribution. In short, it is still a probabilistic language model, just not autoregressive.

What they built

They train LLaDA from scratch, not as a fine-tuned masked LM. Then they compare against autoregressive baselines they built themselves.

Key claims in the paper:

LLaDA scales similarly to autoregressive models.
LLaDA 8B is competitive with strong 8B LLMs in in-context learning.
After supervised fine-tuning, it can do instruction following and multi-turn dialogue.
It performs well on a reversal-style task, which is a known weakness in autoregressive LLMs.

Figure 2: Comparison between autoregressive generation (left-to-right token prediction) and diffusion-based generation (iterative denoising of masked tokens).

What confused me at first

If diffusion is viable, why did everyone pick autoregression?

My guess is that autoregression is easier to optimize, easier to sample from, and already has huge infrastructure and tooling. But the paper argues that autoregression is not a requirement for LLM-like behavior. That was the big mental shift for me.

What I think I learned

There are multiple valid objectives for language modeling. Next-token prediction is not the only game.
The training objective can change the kinds of errors the model makes, like the reversal curse.
Scaling still matters. Even with a different objective, bigger models still learn better.

Questions I still have

Sampling speed. Diffusion usually needs multiple steps.
Long-form coherence. Do denoising steps drift over long outputs?
Is the reversal-curse result general, or just a special case?

Why this matters to me as a student

This paper reminded me that the foundations of the field are still flexible. It pushed me to stop treating autoregression as a law of nature and start treating it as one good design choice among others.

If diffusion language models keep improving, they might open up new ways to control generation, improve reasoning behaviors, or reduce certain failure modes. That is exciting from a fundamentals point of view.

Paper: Large Language Diffusion Models (arXiv:2502.09992)

← Back to home