Constrained decoding with weighted automata

I want to share a take on grammar-constrained generation I've been working on for a while.

Setup

You have a grammar (usually JSON Schema), and you want the LLM to produce valid output.

At each step, the LLM outputs a vector of logits over the vocabulary. Those logits are sampled to choose a token.

The idea behind constrained generation is simple: don't let the model generate invalid tokens.

Compute a mask over the vocabulary that zeros out any token that would violate the grammar. Then just sample from what's left.

No more broken JSON.

Two primitive operations

Think of it as two functions:

get_mask(state) returns a bitmask $M \subseteq V$ over the LLM vocabulary.
commit(state, token) updates the parser state with the chosen token.

At runtime the loop looks like:

loop:
  parallel:
    GPU: logits
    CPU: mask
  apply mask
  sample token
  commit to model and constraint

Performance matters

LLM inference is batched: multiple user requests run on the GPU at once.

And mask generation sits directly on the critical decoding path.

If you want to generate 1k tokens per second, you have 1 millisecond to produce a mask. 1 mask every millisecond, 1000 times a second. And if you miss even a single one - if a mask takes 2ms instead of 1 - it screws up scheduling.

Masks can't be late; they have to arrive on time every time.

That's why it's worst-case performance that matters. p99.9.

Commit == incremental parsing

When designing a grammar constraint, some choices seem obvious. There'll be a constraint state, and each time a token is sampled, the state gets notified so it can update. That notification is the commit method.

This is rather like incremental parsing, where the goal is to update a parse state as the input changes. In our case, the edits are always at the end—appending tokens as we go.

So incremental parsers are a natural place to look for inspiration.

Many battle-tested incremental parsing systems such as tree-sitter and Lezer employ Generalized LR (GLR) parsing. I won't get into how that works here, but suffice to say the LR parser has a bunch of states, and at runtime the parser maintains a stack of these states (more precisely state IDs) and decides what to do by reading the top of the stack and the next input symbol, going one input symbol at a time and never backtracking. In this sense, the LR parser's "state" is a stack of state IDs (hope that terminology isn't confusing).

LR parsers are a bit limited when it comes to representing ambiguity, so they don't work for all context-free grammars. But replacing the vanilla stack with a graph-structured stack (GSS) allows it to overcome this limitation while keeping the state compact enough to update cheaply on every token.

Laurence Tratt has a great article on LR parsing where he describes avoiding it for years before realizing it was the right tool. Marijn Haverbeke, the author of Lezer, tells a similar story. It's an interesting pattern: experienced parser authors independently converging on GLR for incremental parsing.

Mask == hard

You have a GSS encoding the current parse state, and you need a mask over LLM tokens. How would we do it?

A naive implementation might check each token by feeding its terminal sequence (or sequences) into the GSS and seeing whether it can be accepted. But this doesn't scale.

Fun fact: The longest non-space token in c200k (GPT-5's tokenizer) is ----------------------------------------------------------------------------------------------------------------. That's 112 dashes, for those counting. And in JavaScript, this is perfectly valid input. To make matters worse, since both - and -- are valid terminals (the latter being the decrement operator), the number of ways to tokenize it is the 113th Fibonacci number—about $10^{23}$ .

And there are 200k tokens to check—that's still 200k masks to compute per step.

To make this sane, most constraint libraries build a trie or trellis representing all possible tokenizations of all LLM tokens in a compact way. To distinguish between LLM tokens, each edge in the tree can be conditioned on a set of LLM tokens.

But a trie alone isn't enough. Each token can invoke many, many reductions before it shifts. So, even if you're only processing 300 tokens in a single mask step, that can sum up to thousands of GSS operations, which aren't cheap. Plus the GSS fragments over time, so you have to simplify it continually, which costs even more.

The naive approach of iterating over the vocabulary and all possible tokenizations is a non-starter.

I've tried to make this fast. It's a dead end.

The answer: Weighted automata

Token validity depends on what's on the stack. Instead of feeding terminals into the GSS per token, read the stack directly and filter tokens in one pass.

A weighted automaton is like a finite automaton, but each transition carries a weight that accumulates as you traverse. Here the weights are bitsets of tokens. Traversing a transition intersects the token set; merging paths onto one state takes their union.

Weighted automata are well-studied, with known determinization and minimization strategies.

Why this feels close to optimal

On the commit side, GLR-style techniques are a strong fit for fast incremental context-free grammar (CFG) parsing. They use table-driven decisions and represent ambiguity with a compact GSS. This is the family of techniques behind incremental parsers like tree-sitter. If you want a parser that updates quickly as new tokens arrive, this design is a natural place to start.

On the mask side, the bitset-weighted automaton performs the minimum kind of work you can reasonably hope for: it reads the stack once, never backtracks, and stops as soon as the decision is determined. It reads only as deep as needed to decide the mask. The computation is driven purely by the information the stack actually contains, and it terminates as soon as that information is sufficient.

Fast run, slow compile

This is the part that took most of the engineering time for me. The idea is simple; getting it to behave on large grammars took a bag of tricks to avoid exponential blowups—especially around determinization and minimization—and to get compile times down.

Maybe I'll write about that next.