Constrained decoding with weighted automata

I want to share a take on grammar-constrained generation I've been working on for a while.

Setup

You have a grammar (usually JSON Schema), and you want the LLM to produce valid output.

At each step, the LLM outputs a vector of logits over the vocabulary. Those logits are sampled to choose a token.

The idea behind constrained generation is simple: don't let the model generate invalid tokens.

Compute a mask over the vocabulary that zeros out any token that would violate the grammar. Then just sample from what's left.

No more broken JSON.

Two primitive operations

Think of it as two functions:

get_mask(state) returns a bitmask $M \subseteq V$ over the LLM vocabulary.
commit(state, token) updates the parser state with the chosen token.

At runtime the loop looks like:

loop:
  parallel:
    GPU: logits
    CPU: mask
  apply mask
  sample token
  commit to model and constraint

Performance matters

LLM inference is batched: multiple user requests run on the GPU at once.

And mask generation sits directly on the critical decoding path.

If you want to generate 1k tokens per second, you have 1 millisecond to produce a mask. 1 mask every millisecond, 1000 times a second. And if you miss even a single one - if a mask takes 2ms instead of 1 - it screws up scheduling.

Masks can't be late; they have to arrive on time every time.

That's why it's worst-case performance that matters. p99.9.

Commit == incremental parsing

Commit is incremental parsing. Byte chunks arrive one at a time, and the parser updates its state after each.

For incremental parsing, GLR is the natural fit—it's general enough to handle ambiguous grammars and fast enough for real-time use. Tree-sitter and Lezer both use GLR-style techniques for the same reason. Ambiguity is tracked in a graph-structured stack (GSS), which keeps the state compact.

Mask == hard

You have a GSS encoding the current parse state, and you need a mask over LLM tokens. How would we do it?

A naive implementation might check each token by feeding its terminal sequence (or sequences) into the GSS and seeing whether it can be accepted. But this doesn't scale.

Fun fact: The longest non-space token in c200k (GPT-5's tokenizer) is ----------------------------------------------------------------------------------------------------------------. That's 112 dashes, for those counting. And in JavaScript, this is perfectly valid input. To make matters worse, since both - and -- are valid terminals (the latter being the decrement operator), the number of ways to tokenize it is the 113th Fibonacci number—about $10^{23}$ .

And there are 200k tokens to check—that's still 200k masks to compute per step.

To make this sane, most constraint libraries build a trie or trellis representing all possible tokenizations of all LLM tokens in a compact way. To distinguish between LLM tokens, each edge in the tree can be conditioned on a set of LLM tokens.

But a trie alone isn't enough. Each token can invoke many, many reductions before it shifts. So, even if you're only processing 300 tokens in a single mask step, that can sum up to thousands of GSS operations, which aren't cheap. Plus the GSS fragments over time, so you have to simplify it continually, which costs even more.

The naive approach of iterating over the vocabulary and all possible tokenizations is a non-starter.

I've tried to make this fast. It's a dead end.

The answer: Weighted automata

Token validity depends on what's on the stack. Instead of feeding terminals into the GSS per token, read the stack directly and filter tokens in one pass.

A weighted automaton is like a finite automaton, but each transition carries a weight that accumulates as you traverse. Here the weights are bitsets of tokens. Traversing a transition intersects the token set; merging paths onto one state takes their union.

Weighted automata are well-studied, with known determinization and minimization strategies.

Why this feels close to optimal

On the commit side, GLR-style techniques are a strong fit for fast incremental context-free grammar (CFG) parsing. They use table-driven decisions and represent ambiguity with a compact GSS. This is the family of techniques behind incremental parsers like tree-sitter. If you want a parser that updates quickly as new tokens arrive, this design is a natural place to start.

On the mask side, the bitset-weighted automaton performs the minimum kind of work you can reasonably hope for: it reads the stack once, never backtracks, and stops as soon as the decision is determined. It reads only as deep as needed to decide the mask. The computation is driven purely by the information the stack actually contains, and it terminates as soon as that information is sufficient.

Fast run, slow compile

This is the part that took most of the engineering time for me. The idea is simple; getting it to behave on large grammars took a bag of tricks to avoid exponential blowups—especially around determinization and minimization—and to get compile times down.

Maybe I'll write about that next.