Constrained decoding with weighted automata

This is a take on grammar-constrained generation I've been working on for a while.

The two primitive operations

I think of it as two functions:

get_mask(state) returns a bitmask $M \subseteq V$ over the LLM vocabulary.
commit(state, token) updates the parser state with the chosen token.

At runtime the loop looks like:

loop:
  parallel:
    logits = llm.step()          # GPU
    mask = constraint.get_mask() # CPU

  logits = apply_mask(logits, mask)
  token = sample(logits)
  llm.commit(token)
  constraint.commit(token)

LLM tokens don’t align cleanly with grammar terminals, so the implementation relies on a precomputed mapping between them.

Obviously I am simplifying a bit here—tokenizer state matters too, but this is the basic idea.

Commit == incremental parsing

Commit is just incremental GLR parsing. The parser receives consecutive chunks of input and updates its internal state as they come. Ambiguity is represented in a graph-structured stack (GSS), which keeps the state compact.

Computing the mask is the hard part

You have a GSS encoding the current parse state, and you need a mask over LLM tokens. What would a good solution look like?

The vocabulary is large. Checking each token by simulating a full parse would explode the work.

Token validity is a function of the stack. If that’s true, the ideal solution probably reads the stack one item at a time and filters tokens as it goes.

Different terminals require different read depths, but the operation is the same. A token maps to one or more terminal sequences, so each token corresponds to a small stack‑reading pattern. Many tokens end up sharing those patterns because they map to the same terminals.

So we track all tokens together while reading the stack. Build a weighted automaton whose transitions carry bitsets of tokens that can take them, and whose accepting states carry tokens that can finish there. Then read the stack once, keeping track of which tokens are still compatible (live) and which are already valid (accepted):

live = ALL_TOKENS
accepted = EMPTY
state = start
for sym in stack_top_to_bottom:
  live &= transition_bits[state, sym]
  state = next_state[state, sym]
  accepted |= accept_bits[state] & live
  live &= ~accepted
  if live is empty:
    break
return accepted

This is fast because the work is just bitset intersections along one pass over the stack, and tokens that accept early drop out of live.

Why this feels close to optimal

On the commit side, GLR-style techniques are a strong fit for fast incremental CFG parsing. They encode as much work as possible in tables and represent ambiguity with a compact GSS. This is the family of techniques behind incremental parsers like tree-sitter. If you want a parser that updates quickly as new tokens arrive, this design is a natural place to start.

On the mask side, the bitset-weighted automaton performs the minimum kind of work you can reasonably hope for: it reads the stack once, never backtracks, and stops as soon as the decision is determined. It reads only as deep as needed to decide the mask. The computation is driven purely by the information the stack actually contains, and it terminates as soon as that information is sufficient.

Summary

Grammar-constrained decoding is two operations: mask and commit. GLR gives you a fast, principled commit step for CFGs. The mask can be computed by treating token validity as a stack-reading automaton and folding all tokens into a single weighted automaton with bitset weights. The result is a mask computation that is linear in the size of the GLR state, uses only bitset intersections, and does not backtrack.

If you want grammar-constrained decoding to be fast, this is the piece that matters.