Article Comparison (Interleaved)

Article Comparison

This document interleaves three versions of the article for comparison:

🔴 Version A: constrained-decoding-weighted-automata.md (original)
🟢 Version B: constrained-decoding-weighted-automata 1.md (expanded)
🟡 Version C: Stack-Driven Masking...md (polished)

📌 (Introduction)

🔴 Version A

I want to share a take on grammar-constrained generation I've been working on for a while. This post is a preliminary writeup: enough detail to communicate the core read-GLR-stack-with-a-weighted-automata idea (and timestamp it). I’ll likely follow up with a post on the various tricks you need to actually build the damn things later.

🟢 Version B

I want to share a take on grammar-constrained generation I’ve been working on for a while. This post is a preliminary writeup: enough detail to communicate the core “read the (G)LR stack with a weighted automaton” idea (and timestamp it). I’ll likely follow up with a post on the various tricks you need to actually build the damn things without the compiler exploding.

🟡 Version C

This post is intentionally a “foot-in-the-door” writeup: enough detail to communicate the core idea (and timestamp it), not a full implementation guide. I’ll likely follow up with the compile-time tricks later.

📌 The problem

🔴 Version A

You have a grammar (often a JSON Schema) and you want the LLM to produce grammatically valid output.

At each step, the LLM outputs logits over the vocabulary $V$ . You sample a token. Append it to the output. Repeat.

Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.

So we maintain a constraint state and, for each step, compute a mask $M \subseteq V$ of allowed tokens, and apply it to the logits.

No more broken JSON.

🟢 Version B

You have a grammar (often a JSON Schema, or “JSON-but-with-some-fields”, or a programming-language subset) and you want the LLM to produce grammatically valid output.

At each step, the LLM outputs logits over the vocabulary $V$ . You sample a token. Append it to the output. Repeat.

Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.

So we maintain a constraint state and, for each step, compute a mask $M \subseteq V$ of allowed tokens, and apply it to the logits.

No more broken JSON.

A precise way to say “allowed” (this matters later):

A token $v \in V$ is valid given the current prefix $w$ if there exists some continuation $u$ such that $wvu$ is in the grammar language.

Equivalently, the mask is

\text{GetMask}(w) = \{ v \in V : \exists u.\; wvu \in L(G) \}.

That “ $\exists u$ ” is the whole game: we’re not asking “does this token complete a valid document right now?”, we’re asking “could it still lead to a valid completion?”

🟡 Version C

(Section not present)

📌 Two primitive operations

🔴 Version A

Think of it as two functions:

get_mask(state) returns a bitmask $M \subseteq V$ over the LLM vocabulary.
commit(state, token) updates the parser state with the chosen token.

At runtime the loop looks like:

loop:
  parallel:
    GPU: logits
    CPU: mask
  apply mask
  sample token
  commit token to model and constraint

🟢 Version B

Think of constrained decoding as two functions:

get_mask(state) -> bitmask: returns a bitmask $M \subseteq V$ over the LLM vocabulary.
commit(state, token) -> state: updates the constraint state with the chosen token.

At runtime the loop looks like:

loop:
  parallel:
    GPU: logits
    CPU: mask = get_mask(state)
  apply mask to logits
  sample token
  state = commit(state, token)

Nothing fancy, except: get_mask is on the critical path of decoding.

🟡 Version C

It’s useful to separate constrained decoding into two methods:

get_mask(state) -> bitset[V]
Returns which LLM tokens are valid next from this state.
commit(state, token)
Update the constraint state after that token has been chosen.

At runtime you want a tight loop like:

loop:
  parallel:
    GPU: compute logits
    CPU: compute mask
  apply mask to logits
  sample token
  commit token to:
    - model KV-cache
    - constraint state

This split matters because:

commit is “do work proportional to what we actually emitted” (one token).
get_mask is “do work every step no matter what” (critical path).

📌 Why worst-case latency matters

🔴 Version A

LLM inference is batched: multiple user requests run on the GPU at once. And mask generation sits directly on the critical decoding path.

If you want to generate 1k tokens per second, you have 1 millisecond to produce a mask. And if you miss even a single one - if a mask takes 2ms instead of 1 - it screws up scheduling. Masks can't be late; they have to arrive on time every time. It's worst-case performance that matters.

It’s not “can you do masking fast on average?” It’s “can you do it fast every single step?”

🟢 Version B

LLM inference is batched: multiple user requests run on the GPU at once, and each step wants a mask at the same time.

If you want to generate ~1k tokens/sec, you have ~1ms to produce a mask. And if you miss even a single one—if a mask takes 2ms instead of 1—then in a batched setting it screws up scheduling: the GPU sits waiting on a straggler CPU mask, and suddenly your p99 latency is trash.

Masks can’t be late; they have to arrive on time every step. It’s worst-case performance that matters.

It’s not “can you do masking fast on average?”
It’s “can you do it fast every single step?”

🟡 Version C

(Section not present)

📌 Commit ~ incremental parsing

🔴 Version A

The commit op is analogous to incremental parsing. You're incrementally parsing a growing prefix. Many battle-tested systems like tree-sitter and Lezer use Generalized LR (GLR) parsing with a graph-structured stack (GSS).

The LR parser maintains a stack of state IDs and decides what to do by reading the top of the stack plus the next input symbol. The "Generalized" part handles ambiguity: instead of a single stack, you keep a graph of possible stacks sharing structure.

LLM Token: An element of the model vocabulary—a BPE token with an ID and byte string. ~200k of these in modern tokenizers.
Grammar Terminal: The symbols your parser actually consumes. Might be bytes, characters, or lexer tokens like --.
Parser State ID: An integer identifying the LR automaton's current state. The parser stack is a sequence of these.

🟢 Version B

The commit op is basically incremental parsing: you’re incrementally parsing a growing prefix.

Many battle-tested systems like tree-sitter and Lezer use Generalized LR (GLR) parsing with a graph-structured stack (GSS).

An LR parser maintains a stack of parser state IDs.
It decides what to do by looking at the top of stack plus the next terminal symbol.
The “Generalized” part handles ambiguity: instead of a single stack, you keep a graph of possible stacks sharing structure (the GSS). When ambiguity arises, stacks fork; when they converge to the same state, they merge; error paths are pruned.

So commit is not the “new idea” here. commit is: use a decent incremental GLR parser and keep it compact.

The interesting part is get_mask.

LLM Token: An element of the model vocabulary—a BPE token with an ID and byte string. ~200k of these in modern tokenizers.
Grammar Terminal: The symbols your parser actually consumes. Might be bytes, characters, or lexer tokens like --, IDENT, STRING.
Parser State ID: An integer identifying the LR automaton's current state. The parser stack is a sequence of these.

🟡 Version C

(Section not present)

📌 Generating masks

🔴 Version A

Now you have a parse state (often a GSS), and you want a mask over LLM tokens.

A straightforward approach is:

For each LLM token $t$ , check whether appending it keeps the parse valid.

But you don't append an LLM token to the parser. You append terminal symbols. A single LLM token might expand to multiple terminals, and that segmentation can be ambiguous.

🟢 Version B

Now you have a parse state (often a GSS), and you want a mask over LLM tokens.

A straightforward approach is:

For each LLM token $t \in V$ , check whether appending it keeps the parse valid.

But you don’t append an LLM token to the parser. You append terminal symbols.

And a single LLM token is a byte string which might:

contain multiple terminals ("}\n" could be RBRACE, NEWLINE, …),
end partway through a terminal (halfway through a string literal, halfway through a number, halfway through true),
or be lexically ambiguous depending on context.

So “try token $t$ ” is really “try every way $t$ ’s bytes could advance the lexer into some terminal sequence, and then try advancing the parser for that terminal sequence.”

That mismatch is where a lot of “obvious” masking schemes get wrecked.

🟡 Version C

(Section not present)

📌 Example

🔴 Version A

Consider: in a lexer where - and -- are both valid terminals, a run of N dashes has $F_{N+1}$ possible segmentations (Fibonacci growth). The longest non-space token in cl200k (GPT-5's tokenizer) is 112 dashes. That's $10^{23}$ segmentations to consider—for a single candidate token.

And there are 200k tokens to check.

🟢 Version B

(Section not present)

🟡 Version C

(Section not present)

📌 Why tries/trellises still hurt in p99.9 land

🔴 Version A

A common direction is:

Build a trie/trellis representing how LLM tokens map to terminal sequences.
Traverse that trie while advancing the parser state, pruning invalid paths.
Collect the set of LLM tokens that survive.

For example, (our ------------- token can be turned into a trie blah blah)

This can work well for many grammars and workloads. But for general CFG-ish constraints, there are still pathological cases.

You might look at our ------------- example and be clever and say 'well a - symbol can never follow a -- and vice-versa, so we can eliminate a bunch of edges'. Yes. But we're still stuck with a long sequence of - that we need to parse every time we wanna generate a mask. And yes, any arbitrarily long sequence of -s is IS perfectly valid (albeit horrible) in Python, Javascript etc., e.g. ------------------1 (include snippet of running it in Python REPL and it evaluating).

So you're still going to end up processing a lot of terminals every time you want to generate a mask. And you're still doing GSS-heavy parser work for each terminal.

You can transform the grammar to remove unit and null reductions. You can carefully merge and simplify the GSS as you go.

But try as you might, a single terminal shift can trigger many reductions which, in GLR, tend to fan out. Ultimately those operations are pointer-heavy and cache-unfriendly, and hurt worst-case performance in a manner that's hard to engineer away.

I tried hard to make this fast. It's a dead end.

🟢 Version B

A common direction is:

Build a trie/trellis representing how LLM tokens map to terminal sequences.
Traverse that trie while advancing the parser state, pruning invalid paths.
Collect the set of LLM tokens that survive.

This can work well for many grammars and workloads. People implement this and get decent average performance, especially for “simple” grammars.

But for general CFG-ish constraints (especially when you care about worst-case), you run into two issues:

1) You end up doing real parser work “inside” masking

Even if your lexical side is deterministic (say you enforce longest-match, or you have a DFA lexer), a single grammar terminal shift can trigger a chain of reductions.

In an LR parser, “shift terminal $t$ ” is often really:

while top-of-stack wants to reduce, do some reductions (pop some states, push a goto state),
then shift $t$ .

Those reduction chains are table-driven and fast in the happy path, but in GLR they can fan out. The operations are pointer-heavy and cache-unfriendly: you’re manipulating a GSS, merging nodes, pruning branches, etc.

If you do that inside masking—i.e. while traversing a trie of possible tokens—you’re effectively building and mutating a GSS for a hypothetical next token, and doing it potentially thousands of times per step.

That’s exactly the kind of p99.9 work you don’t want on the critical path.

2) You still pay for long/ugly tokens

Even with tries, there are tokens whose byte strings correspond to long sequences of terminals (think: long runs of punctuation, or tokens containing lots of structural characters plus whitespace/newlines). Those paths can survive pruning for a while, so you end up doing a lot of terminal-level processing per mask computation.

And yes, long ugly sequences can be syntactically valid in real languages. Python, JS, etc. happily accept monstrosities like:

------------------1

That means “a lot of unary minus operators.” It’s valid, and it forces the parser to do real work on a long terminal sequence.

You can transform the grammar to remove unit/null reductions. You can aggressively merge equivalent GSS nodes. You can add clever memoization.

I tried hard to make the “trie + incremental parse simulation” approach behave well in worst-case latency terms. In my experience, it’s a dead end if you’re aiming for predictable sub-millisecond masking on arbitrary inputs/grammars.

🟡 Version C

A common direction is:

Build a trie/trellis representing how LLM tokens map to terminal sequences.
Traverse that trie while advancing the parser state, pruning invalid paths.
Collect the set of LLM tokens that survive.

This can work well for many grammars and many workloads. My claim is narrower:

For general CFG-ish constraints + large vocabularies + strict worst-case deadlines, this approach has pathological cases that are hard to engineer away.

The big practical pain points are:

Reductions are “invisible work”: a single terminal shift can trigger many reductions. In GLR, those reductions can fan out.
Pointer-heavy operations: GSS traversal/manipulation is branchy and cache-unfriendly compared to tight bitset ops.
Worst-case variability: some steps are cheap; some steps hit deep reduction chains and/or many active heads; those are the steps that break batching.

That’s what I mean by “dead end” in the original draft: not “nobody can make it fast,” but “I couldn’t get a per-token / per-trie-path simulation approach to have predictable p99.9 behavior at scale without the complexity ballooning.”

📌 Invert the problem

🔴 Version A

Token validity depends on what's on the stack. Instead of asking "does LLM token t work on this stack?" for each token, ask "given this stack, which LLM tokens work?"

That's the reframing. Now we need a data structure that turns "stack → allowed tokens" into something we can execute fast.

That’s where the weighted automaton comes in.

🟢 Version B

Token validity depends on what’s on the stack.

Instead of asking:

“does LLM token $t$ work on this stack?” (for each $t$ )

ask:

“given this stack, which LLM tokens work?”

That’s the reframing.

Now we want a data structure that turns:

stack → allowed-token bitset

into something we can execute quickly.

That’s where the weighted automaton comes in.

🟡 Version C

(Section not present)

📌 Weighted automata over parser states

🔴 Version A

Think of a finite automaton, except each transition carries a weight, and traversing paths combines weights.

In our weighted automata:

Symbols are parser state IDs
Weights are bitsets over the LLM vocabulary (allowed-token sets).
Traversing a transition applies an intersection (filter).
When paths merge onto one state, we union their weights, (combining alternatives).

The automaton reads (a representation of) your current parse configuration:

For a single LR stack: the sequence of state IDs from top to bottom.
For GLR: a GSS, i.e. a DAG representing many possible stacks sharing suffixes.

🟢 Version B

Think of a finite automaton, except:

each transition carries a weight (here: a bitset over tokens), and
when multiple paths lead to the same state, you combine their weights.

In the automata I’m talking about:

Input symbols are LR parser state IDs (i.e. elements of the parse stack).
Weights are bitsets over the LLM vocabulary $V$ .
Traversing a transition applies an intersection (filter tokens).
Merging paths onto one state applies a union (tokens valid via any path).

So it’s like “run a DFA on the stack”, but with “which tokens survive” flowing through it.

The automaton reads a representation of your current parse configuration:

For a single LR stack: the sequence of state IDs from top to bottom.
For GLR: a GSS, i.e. a DAG representing many possible stacks sharing structure.

A token is valid if it’s valid on any parse path in the GSS, so on a GSS you just union the results across paths.

🟡 Version C

(Section not present)

📌 Another way to see it

🔴 Version A

Another way to see it (which helped me):

Fix an LLM token $t$ .
Consider the set of all possible tokenisations of $t$ .
Consider the set of parser stacks blah

That set of stacks is something you can represent with a finite automaton over stack state IDs (a kind of "language of stacks").

Of course, if you did that for every $t \in V$ , you'd have 200k automata. That's useless at runtime.

A weighted automaton is how you smash those 200k membership tests into one run:

Transitions are annotated with “which tokens follow this transition.”
Running it once on the stack gives you “which tokens’ automata accept this stack.”

🟢 Version B

(Section not present)

🟡 Version C

(Section not present)

📌 A nice optimization

🔴 Version A

It's also trivial to modify the weighted automata to stop early when reading when further stack depth cannot change the result.

(See subtract_final_weights_from_outgoing for actual code. Don't cite the code, obviously. Explain it.)

In our "language of stacks", this is equivalent to removing a stack from the language when a prefix of it already exists.

🟢 Version B

(Section not present)

🟡 Version C

(Section not present)

📌 Runtime sketch

🔴 Version A

For a single LR stack, the runtime looks like:

def get_mask(stack_state_ids_top_to_bottom):
    # frontier: map automaton_state -> bitset(tokens)
    frontier = { A.start: ALL_TOKENS }

    for sid in stack_state_ids_top_to_bottom:
        new = {}
        for a_state, tokens in frontier.items():
            for (a2, weight) in A.step(a_state, sid):
                tokens2 = tokens & weight            # intersection (filter)
                if tokens2.any():
                    new[a2] = new.get(a2, EMPTY) | tokens2  # union (merge paths)
        frontier = new

        if decided(frontier):
            break

    return combine_accepting(frontier)

For a GSS, you do basically the same computation but over a graph rather than a single list. (Details omitted here, but the point is: it’s the same ∩/∪ propagation pattern.)

🟢 Version B

(Section not present)

🟡 Version C

For a single LR stack, the runtime looks like:

def get_mask(stack_state_ids_top_to_bottom):
    # frontier: map automaton_state -> bitset(tokens)
    frontier = { A.start: ALL_TOKENS }

    for sid in stack_state_ids_top_to_bottom:
        new = {}
        for a_state, tokens in frontier.items():
            for (a2, weight) in A.step(a_state, sid):
                tokens2 = tokens & weight            # intersection (filter)
                if tokens2.any():
                    new[a2] = new.get(a2, EMPTY) | tokens2  # union (merge paths)
        frontier = new

        if decided(frontier):
            break

    return combine_accepting(frontier)

For a GSS, you do the same computation but over a graph rather than a single list:

Branching in the GSS corresponds to alternative stacks → union at merges.
Shared suffixes mean you don’t re-walk shared stack tails per alternative.

(Details omitted here, but the point is: it’s the same ∩/∪ propagation pattern.)

📌 (Heading?)

🔴 Version A

Putting it together:

commit(token): (Should be something else? commit(state, token) -> state??)

uses GLR machinery:
update the GSS for the actual emitted token
keep ambiguity compact

get_mask(state) (get_mask(state) -> mask)

uses the weighted automaton:
treat the current GSS as the “input”
propagate token bitsets through the automaton with ∩/∪
return a vocabulary mask

🟢 Version B

(Section not present)

🟡 Version C

(Section not present)

📌 Why this feels close to optimal

🔴 Version A

I’m avoiding “this is optimal” as a theorem. (Blah blah blah I mean GLR doesn't really have 'this is optimal' theorems, but it feels optimal. Maybe just because so many people use it. Maybe it's just because it's fast. Maybe there's some deeper reason. IDK. But it does feel optimal.)

On the commit side, GLR-style techniques are already a strong fit for incremental CFG parsing. Table-driven decisions, compact ambiguity representation via GSS. There's a reason people working on these kinds of problems keep converging on GLR.

On the mask side, the weighted automaton performs the minimum work you'd reasonably hope for: read the stack once, never backtrack, stop when determined. Inner loops are bitset intersections and unions—fast, predictable.

This doesn’t magically resolve the lack of strict bounds on GLR parser time complexity, behaviour in pathological cases. You still inherit those. (Which are?? Theorems? Lack of theorems proving efficiency?) But... blah blah.

🟢 Version B

I’m deliberately avoiding “this is optimal” as a theorem. Parsing isn’t a domain where you get many satisfying worst-case optimality results that also match real-world constraints.

But it feels close to optimal for two reasons:

🟡 Version C

(Section not present)

📌 Fast run, slow compile

🔴 Version A

Precompiling the grammar into a weighted automaton involves determinization and minimization in a semiring setting. Large grammars can blow up if you're careless.

Getting compile-time and memory to behave took most of my engineering effort. That's probably the next post.

🟢 Version B

The cost shifts to compile time.

Precompiling the grammar into this Parser DWA involves determinization and simplification in a semiring-ish setting (bitset weights, union at merges, intersection along paths). If you do it naively, large grammars can blow up in memory/time.

Getting compile-time and memory to behave took most of my engineering effort:

determinization order matters a lot,
when you push bitsets around you must be careful about sharing/copying,
you need aggressive pruning/cancellation,
you need to normalize certain grammar patterns to keep reduction chains bounded,
and you really want a representation that plays well with CPU caches.

That’s probably the next post.

If you’re building constrained decoding and you care about p99.9 latency, my main takeaway is:

don’t put real parsing work in get_mask,
make commit do the parsing once for the chosen token,
and make get_mask a precomputed “read stack → bitset” computation.

That inversion is the whole trick.

🟡 Version C

(Section not present)

📌 Example: segmentation explosions

🔴 Version A

(Section not present)

🟢 Version B

Here’s a toy example that captures the shape of the problem.

Suppose your terminals include - and -- (or more generally, terminals that overlap). If you treat tokenization as “split into terminals in any way that works” (scannerless style), then a run of $N$ dashes has $F_{N+1}$ possible segmentations (Fibonacci growth), because you’re tiling length $N$ using pieces of length 1 and 2.

And tokenizers absolutely contain long weird tokens. In cl200k there are tokens that are on the order of 100+ repeated punctuation characters. For $N \approx 112$ , $F_{N+1}$ is $\sim 10^{23}$ segmentations.

That’s for a single candidate token.

And there are ~200k tokens to check.

This is obviously not something you can do per decoding step.

🟡 Version C

(Section not present)

📌 Another way to see it (why weights help)

🔴 Version A

(Section not present)

🟢 Version B

Another mental model that helped me:

Fix an LLM token $t$ .
Consider all the ways its bytes could be tokenized into grammar terminals (given incremental lexing, overlapping terminals, etc.).
For each such terminal sequence, consider all stack configurations from which that sequence can be legally parsed (shift/reduce/goto) without error.

That set of stacks is a regular language over parser state IDs (this is closely related to the classical “viable prefixes are regular” result from LR theory).

So you can imagine an automaton $A_t$ that recognizes “stacks on which token $t$ is valid.”

Of course, if you do that for every $t \in V$ , you’d have 200k automata. That’s useless at runtime.

A weighted automaton is how you smash those 200k membership tests into one run:

Transitions are annotated with “which tokens would accept if we take this transition.”
Running it once on the stack gives you exactly: “which tokens’ $A_t$ would accept this stack.”

Same computation, but vectorized across tokens via bitsets.

🟡 Version C

(Section not present)

📌 A nice optimization: stop reading the stack early

🔴 Version A

(Section not present)

🟢 Version B

If you implement the obvious “read stack symbols until you hit bottom” approach, you’ll still sometimes do more work than needed: often, after reading a small suffix of the stack, the set of valid tokens is already determined and deeper stack symbols can’t change it.

You can bake this into the automaton:

Each automaton state has an “already-valid” contribution (tokens that are valid regardless of what comes next).
Outgoing transitions only carry the “still-unknown” contribution.

Operationally, it’s the difference between:

“keep filtering ALL tokens forever”
vs
“as soon as some tokens are decided, peel them off, and only keep propagating the remainder”

So you can stop when there’s nothing left in flight.

I’m not going to cite code here, but conceptually it’s just a precomputed fixed-point thing: for each automaton state, compute which tokens are guaranteed from this point (“final weight”), subtract them from the weights you propagate forward, and you get a monotone decreasing set that hits empty quickly.

🟡 Version C

(Section not present)

📌 Runtime sketch (single stack)

🔴 Version A

(Section not present)

🟢 Version B

For a single LR stack, the runtime looks like:

def get_mask(stack_state_ids_top_to_bottom):
    # frontier: map automaton_state -> bitset(tokens)
    frontier = { A.start: ALL_TOKENS }

    decided = EMPTY  # tokens already known valid from suffix read so far

    for sid in stack_state_ids_top_to_bottom:
        new = {}
        for a_state, tokens in frontier.items():
            for (a2, weight) in A.step(a_state, sid):
                tokens2 = tokens & weight                 # intersection (filter)
                if tokens2.any():
                    new[a2] = new.get(a2, EMPTY) | tokens2  # union (merge paths)
        frontier = new

        # Optionally accumulate “final/decided” contribution here and
        # remove it from frontier tokens, so frontier only contains
        # “still-undecided” tokens.

        if not frontier:               # nothing left that depends on deeper stack
            break

    return decided | combine_accepting(frontier)

This is the key operational point:

You are not iterating over the vocabulary.
You are not simulating big chains of reductions at runtime.
You are doing:
- transition-table lookups (CPU-cache-friendly),
- bitset intersections,
- bitset unions at merges.

🟡 Version C

(Section not present)

📌 Runtime on a GSS

🔴 Version A

(Section not present)

🟢 Version B

For a GSS, you do basically the same computation, but over a graph rather than a single list.

Conceptually:

each edge in the GSS is labeled by a parser state ID,
each node represents a shared suffix,
you propagate $\cap$ $\cap$ / $\cup$ $\cup$ through the product of:
- GSS structure (many stacks),
- automaton transitions.

A token is valid if it’s valid on any stack path, so whenever GSS paths merge you union their token sets, and whenever automaton paths merge you union there too.

The important part is: it’s still the same “bitset flows through a graph via ∩ and ∪” pattern. No backtracking, no “try token $t$ ” loop.

🟡 Version C

(Section not present)

📌 So where does the automaton come from?

🔴 Version A

(Section not present)

🟢 Version B

Up to now I’ve treated “the weighted automaton” as magic.

It’s not magic, it’s just precomputation.

At a high level, you want an automaton that answers:

given the current lexer+parser state (represented by the stack/GSS), which LLM tokens could lead to a valid continuation?

There are two distinct problems mixed together:

Lexical: what grammar terminals could a given LLM token’s bytes produce (given where we are in the lexer)?
Syntactic: if we fed those terminals to the LR parser, would they be legal given the current stack?

The thing I compile is basically a composition of two automata:

a token→terminal-sequences automaton (“Terminal DWA”),
and a terminal→stack-effect automaton (“Template automata”), composed into a final Parser DWA that reads stack state IDs and outputs valid-token bitsets directly.

I’ll keep this section high-level (the details are where the compile-time pain is), but the decomposition matters because it explains why the runtime is so simple.

🟡 Version C

(Section not present)

📌 1) Token → terminals (Terminal DWA)

🔴 Version A

(Section not present)

🟢 Version B

An LLM token is a byte string.

A grammar tokenizer/lexer consumes a stream of bytes and emits grammar terminals. Crucially:

the lexer may be in the middle of recognizing a terminal,
and longest-match means “a match might be extendable,” so incremental lexing can require keeping multiple active lexer configurations (more on this later).

So the mapping “token $v$ → terminal sequence” is not a single fixed lookup. It depends on the current lexer state.

The Terminal DWA is the precomputed structure that answers:

from lexer state $x$ , which LLM tokens $v$ can produce which terminal $t$ next (and what lexer state do we end up in)?

A practical way to build it is:

build a trie over all vocabulary token byte strings,
simulate the grammar lexer over that trie,
record (state, terminal, next_state) edges weighted by “which vocabulary tokens realize this edge.”

Then determinize.

The key thing the Terminal DWA buys you is: it collapses “iterate over 200k tokens and run the lexer” into “traverse a small automaton state space and get bitsets.”

🟡 Version C

(Section not present)

📌 2) Terminal → stack effect (Template automata)

🔴 Version A

(Section not present)

🟢 Version B

Now suppose the lexer says “the next terminal is $t$ .” What does the LR parser do?

Sometimes it just shifts $t$ .
Often it does a chain of reductions first (pop a bunch of states, push goto states), and then shifts $t$ .
Those reduction chains depend on the top of the stack, but importantly: they depend on a bounded suffix of the stack in well-behaved grammars.

You can precompute, for each terminal $t$ , an automaton that reads stack state IDs (top down) and represents all the ways the parser could legally process $t$ , including the reduction chain.

Internally I represent the net stack transformation using a push/pop algebra (the polycyclic monoid is the clean formalism), but you don’t need to care about that to get the intuition:

the template automaton for terminal $t$ encodes “how many states might get popped before we can shift $t$ , and what goto transitions get pushed as a result.”

This is the piece that takes “pointer-heavy GLR reduction simulation” off the runtime path and turns it into precomputed transitions.

🟡 Version C

(Section not present)

📌 3) Compose them into a Parser DWA

🔴 Version A

(Section not present)

🟢 Version B

Finally, you compose:

Terminal DWA edges (which say “token $v$ could yield terminal $t$ from lexer state $x$ ”) with
Template automata (which say “terminal $t$ is legal from stack suffix $s$ ”)

to get one deterministic weighted automaton that:

takes a lexer configuration + a parser stack suffix as input,
and outputs the set of LLM tokens that can extend the parse.

This is the automaton you run in get_mask.

One subtle but very useful algebraic fact during composition: when you concatenate stack-effect templates, you get adjacent “push then pop” pairs that cancel. That lets you simplify aggressively while building the composed automaton (and prune impossible paths when pushes/pops mismatch).

Also: at runtime I don’t actually need the automaton to update the stack. commit is implemented by the real lexer+parser. get_mask just needs to know whether a token is valid, so the automaton can discard some “push” bookkeeping once it has done its filtering job.

🟡 Version C

(Section not present)

📌 Why you only read a bounded suffix of the stack

🔴 Version A

(Section not present)

🟢 Version B

A question I always get is: “do you really scan the entire stack every time?”

No. In practice you only need a bounded suffix, and you can make that precise.

Two related reasons:

In LR theory, the set of viable prefixes (i.e. possible stack configurations) is regular, which is why an automaton over stack states even makes sense here.
If you normalize the grammar to eliminate certain patterns (right recursion / hidden left recursion in the reduction chains), then the reduction chains that matter for “can I shift terminal $t$ ?” are bounded. That makes the per-terminal template automata acyclic, which forces a bound on how far down the stack you ever need to look.

In implementation terms: masks stabilize quickly, and you can early-exit.

🟡 Version C

(Section not present)

📌 Tokenization ambiguity (and longest-match) in streaming generation

🔴 Version A

(Section not present)

🟢 Version B

The term “token” is overloaded:

The LLM tokenizer maps text to subword units from a fixed vocabulary (byte strings).
The grammar tokenizer/lexer maps a byte stream to grammar terminals (IDENT, "{", NUMBER, keywords, etc.).

These tokenizations don’t align.

Even if your lexer uses longest-match, longest-match is inherently forward-looking: you can’t know a match is final until you see what comes next.

Classic example:

"+" followed by "+" should yield INCREMENT ("++"), not two PLUS.

In a streaming setting, after you see the first "+", you’re in a state where:

you have a valid match for PLUS,
but it’s extendable to INCREMENT.

So you have to represent that uncertainty somehow.

The approach I use is what I call inhibited terminals (this is just a convenient operational trick, not a deep theory term):

when a terminal $t$ $t$ matches but remains extendable, you fork:
- one branch commits $t$ now (but marks it “inhibited”: it will be invalidated if a longer match appears),
- another branch waits for more bytes.
if later input yields a longer match, the premature branch is pruned.

This plays nicely with GLR+GSS, because “fork and prune” is already the parser’s native move.

Practically, the constraint state you carry around is not just “parser stack(s)”; it’s “(parser stack(s), lexer state(s))”. get_mask needs to condition on both, which is why the Parser DWA has multiple initial states (one per active lexer state).

🟡 Version C

(Section not present)

📌 Putting it together

🔴 Version A

(Section not present)

🟢 Version B

At runtime you now have two different kinds of work, with very different performance profiles:

🟡 Version C

(Section not present)

📌 `commit(state, token) -> state`

🔴 Version A

(Section not present)

🟢 Version B

Uses GLR machinery.
Feeds the token’s bytes into the grammar lexer incrementally.
Whenever the lexer emits a terminal, apply LR shift/reduce on the GSS.
Maintain compact ambiguity, prune errors, merge when states coincide.

This is the inherently “CFG-y” part. It can still have pathological cases (that’s life), but you at least get decades of parsing engineering behind it.

🟡 Version C

(Section not present)

📌 `get_mask(state) -> mask`

🔴 Version A

(Section not present)

🟢 Version B

Uses the precompiled weighted automaton (Parser DWA).
Treats the current GSS/stack suffixes as the “input string(s)” over parser state IDs.
Propagates token bitsets through the automaton using $\cap$ on transitions and $\cup$ on merges.
Stops early when deeper stack symbols can’t change the result.
Returns a vocabulary mask.

This is the performance-critical part, and it’s exactly where precomputation buys you predictability.

🟡 Version C

(Section not present)

📌 On the commit side: GLR/GSS is a strong local optimum

🔴 Version A

(Section not present)

🟢 Version B

For incremental CFG parsing with ambiguity:

table-driven decisions,
shared structure via GSS,
pruning/merging,

…is a pretty hard baseline to beat. There’s a reason people working on these problems keep converging on GLR-like techniques:

https://tratt.net/laurie/blog/2020/which_parsing_approach.html
https://marijnhaverbeke.nl/blog/lezer.html
https://tree-sitter.github.io/tree-sitter/creating-parsers/3-writing-the-grammar.html
https://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/CSD-97-946.pdf

You still inherit GLR’s lack of comforting worst-case bounds in “fully adversarial ambiguity” settings. In theory, GLR can degrade badly on highly ambiguous grammars/inputs. In practice, for “programming-language-ish” grammars with reasonable disambiguation, it behaves well.

🟡 Version C

(Section not present)

📌 On the mask side: “read once, never backtrack” is what you want

🔴 Version A

(Section not present)

🟢 Version B

The weighted-automaton mask computation does the minimum work you’d reasonably hope for:

scan a bounded portion of the stack once,
do predictable inner-loop ops (bitset ∩ and ∪),
stop as soon as the result is determined,
never iterate over the vocabulary,
never do speculative parser reductions for hypothetical tokens.

In other words: the runtime work scales with the size of the parse configuration, not with vocabulary size and not with “how gnarly are the tokens.”

🟡 Version C

(Section not present)

📌 Executive summary

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Grammar-constrained generation has two very different problems hiding under one name:

commit(token): update the constraint state after you sampled a token. This is incremental parsing.
get_mask(state): given the current constraint state, return the set of next LLM tokens that keep you valid. This is the hard part on the critical path.

For commit, incremental parsing techniques (LR/GLR + a graph-structured stack) are a good fit.

For get_mask, the usual “simulate candidate tokens through the parser” approach ties runtime to vocabulary size and tokenization ambiguity in ways that are brutal for p99.9 latency.

The idea I want to put on the record:

Invert the problem. Precompile the grammar+tokenizer into a bitset-weighted automaton whose input is the parser stack (or GSS). Then get_mask becomes a single pass that reads stack state IDs and does bitset ∩/∪ operations—no per-vocabulary-token simulation.

📌 Setup

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

You have a grammar (often “JSON Schema”, though in practice this means the structural subset you can compile into something CFG-like), and you want the LLM to only emit valid outputs.

At each decoding step the model produces logits over the vocabulary $V$ . You sample a token. You append it to the output. Repeat.

Constrained decoding is: don’t sample tokens that would make the output invalid.

So we maintain a constraint state and, for each step, compute a mask $M \subseteq V$ of allowed tokens.

No more broken JSON—at least syntactically.

📌 Why worst-case latency matters (p99.9, not p50)

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Inference servers batch many sequences on the GPU. The CPU-side mask has to arrive “in time” for each decode step, across the whole batch.

If the GPU step latency is ~1ms in your throughput regime, then you need your mask computation to reliably finish within that same window for every active sequence. A single slow mask can stall the whole batch, which cascades into ugly scheduling behavior.

So it’s not “can you do masking fast on average?” It’s “can you do it fast every single step?”

That’s the lens for everything below.

📌 `commit`: incremental parsing is a good mental model

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

When you call commit(state, token), you are incrementally parsing a growing prefix. This is a well-studied problem.

📌 Glossary (to keep levels straight)

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

There are three different “alphabets” floating around:

LLM token: an element of the model vocabulary $V$ (e.g. 200k entries). Each has a byte string.
Grammar terminal: the symbols your grammar/parser consumes (could be bytes/chars, or lexer tokens like --, Identifier, etc.).
Parser stack state IDs: LR-style automaton states (integers) stored on the parse stack.

Also:

LR: table-driven parsing; the parser’s control state is an integer “state ID” and it maintains a stack of those IDs.
GLR: handles ambiguity by representing multiple possible stacks at once via a graph-structured stack (GSS).

📌 Why GLR/GSS shows up

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

If your constraints have nondeterminism (common once you compile real schema-ish structures, alternatives, optionality, “oneOf”-style branching, etc.), you often end up with “multiple parses remain viable so far.”

The GLR family’s core trick is: don’t backtrack, represent the ambiguity compactly. You keep a set of stack “heads” in a shared graph.

This tends to work well for commit because:

Updates are local and table-driven.
The GSS can share most structure between alternatives.
It’s designed to be updated incrementally.

So far so good.

📌 `get_mask`: this is where things get ugly

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Now you have a parse state (often a GSS), and you want a mask over LLM tokens.

A straightforward approach is:

For each LLM token $t$ , check whether appending it keeps the parse valid.

But you don’t actually append “an LLM token” to the parser. You append:

the token’s byte string,
which is tokenized/lexed into grammar terminals (sometimes with ambiguity),
which drive shifts/reductions in the parser (sometimes many reductions between terminal consumptions),
across potentially many GSS heads.

Doing that per vocabulary token is a non-starter.

📌 The tokenization ambiguity trap (why the Fibonacci story matters)

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Here’s the precise point I want to make with the “112 dashes” anecdote:

Even if you don’t like the specific tokenizer example, incremental lexing can be ambiguous, and ambiguity can explode.

In a toy lexer where the only terminals are - and --, a run of $N$ dashes has $F_{N+1}$ segmentations (Fibonacci growth). For $N=112$ , that’s on the order of $10^{23}$ ways to segment.

Why does this matter to masking?

Because the naive masking algorithm effectively asks:

“Is there any way to segment this LLM token’s bytes into terminals such that the parser can consume them from the current state?”

If you handle “token → terminal sequence(s)” by exploring a tokenization trellis, then in the worst case you’re exploring an exponential number of segmentations per candidate LLM token.

Most practical systems avoid the full explosion with tries/trellises and memoization. That helps. But it doesn’t change the fundamental shape of the problem: you’re still trying to answer a question about all 200k tokens by simulating “what if we took this token?”

📌 Invert the problem: read the stack, don’t simulate the vocabulary

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

The key reframing is:

Token validity depends on the parse stack(s).
So instead of asking “does token $t$ work on this stack?”, ask “given this stack, which tokens work?”

That sounds tautological, but it changes the engineering shape:

“simulate token” naturally loops over $|V|$ or over a large fraction of the token trie.
“read the stack” naturally loops over stack depth / GSS structure, which is usually far smaller and is the actual information that determines validity.

Now we need a data structure that turns “stack → allowed tokens” into something we can execute fast.

That’s where the weighted automaton comes in.

📌 The answer (the idea): bitset-weighted automata over parser stacks

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

(Empty section)

📌 What I mean by “weighted automaton” here

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Think of a finite automaton, except each transition carries a weight and traversing paths combines weights.

In my setting:

The weights are bitsets over the LLM vocabulary (allowed-token sets).
Traversing a transition applies an intersection (filter):
“tokens that survive this constraint”.
When multiple paths reach the same automaton state, we union their weights:
“tokens that work via any viable path”.

This is exactly the “semiring” you’d expect on sets:

“multiply” along a path: $\cap$
“add” at merges: $\cup$

📌 The missing link (made explicit): the stack is the input string

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

The automaton’s input alphabet is the set of LR parser state IDs.

The automaton reads (a representation of) your current parse configuration:

For a single LR stack: the sequence of state IDs from top to bottom.
For GLR: a GSS, i.e. a DAG representing many possible stacks sharing suffixes.

So the automaton is not replacing the parser. It’s a precompiled masking machine:

Input: current stack / GSS (state IDs)
Output: bitset mask over LLM tokens valid next

📌 Intuition: each token corresponds to “a regular language of stacks”

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Another way to see it (which helped me):

Fix an LLM token $t$ .
Consider the set of parser stacks from which $t$ can be consumed (taking reductions into account, and respecting tokenization/lexing rules).

That set of stacks is something you can represent with a finite automaton over stack state IDs (a “stack language” of sorts).

If you did that for every $t \in V$ , you’d have 200k automata. That’s useless at runtime.

A weighted automaton is how you smash those 200k membership tests into one run:

Transitions are annotated with “which tokens follow this transition.”
Running it once on the stack gives you “which tokens’ automata accept this stack.”

📌 Why this is the right kind of work for p99.9

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Mask computation becomes:

A scan over the relevant portion of the stack/GSS
With inner loops that are mostly:
- transition-table lookups
- bitset intersections
- unions on merge

Critically:

You’re not iterating over the vocabulary.
You’re not traversing a token trie driven by bytes.
You’re not simulating reductions per candidate token path.

The runtime is driven by what the parse state actually contains.

And you can early-exit when further stack depth cannot change the result (the decided(frontier) hook above), which is common if the automaton reaches a fixed point.

📌 How this fits with GLR (commit + mask together)

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Putting it together:

commit(token) uses GLR machinery:
- update the GSS for the actual emitted token
- keep ambiguity compact
get_mask(state) uses the weighted automaton:
- treat the current GSS as the “input”
- propagate token bitsets through the automaton with ∩/∪
- return a vocabulary mask

So you get a split where:

the parser is responsible for correctness of incremental acceptance (commit)
the automaton is responsible for cheaply answering “what next?” (get_mask)

📌 Why I think this is close to the right shape (without overselling “optimal”)

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

I’m avoiding “this is optimal” as a theorem, but the shape matches what I want in production:

No $|V|$ loop on the critical path.
Work proportional to stack/GSS structure, which is the actual dependency.
Inner loop dominated by bitset ops, which are predictable and SIMD-friendly.
Early termination when the mask stops changing.

You still have worst cases (deep stacks exist), but it’s a worst case that comes from the actual output structure (nesting depth), not from adversarial vocabulary/tokenization effects.

📌 Tradeoffs / scope notes (what this post is not claiming)

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

“JSON Schema” caveat: full JSON Schema includes semantic constraints (ranges, regexes, uniqueItems, etc.). This post is about the grammar-shaped core. You can layer semantic predicates on top, but that’s a separate axis.
Fast runtime, slow compile: precompiling into a weighted automaton involves determinization/minimization-like steps (in a semiring setting) and can blow up if you’re careless. Most of my engineering time went into making compile-time and memory behave on large grammars. That’s a follow-up.
Memory is real: a 200k-vocab bitset is ~25KB. You cannot naively hang a fresh 25KB bitset off every transition. You need sharing/interning/chunking/sparsity tricks. Again: follow-up.
This doesn’t magically remove ambiguity: GLR ambiguity still exists; the goal is to make mask computation robust in the presence of it.

📌 Where this sits relative to existing constrained decoding libraries

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Most constrained decoding systems I’ve seen in the LLM ecosystem do some variant of:

represent valid next characters/bytes/terminals from the current parse state, and
intersect that with a token trie to decide which LLM tokens are compatible.

That can be great for many use cases (especially regular/near-regular constraints like JSON).

The approach I’m describing is different in the “mask” half:

It tries to precompile “stack → token mask” directly,
so get_mask is stack-driven rather than vocabulary-/trie-driven.

I’m not claiming other approaches are “wrong.” I’m saying: if you care about strict worst-case deadlines in a batched inference setting, you want a get_mask whose cost is not dominated by $|V|$ or tokenization pathologies.

📌 Closing

🔴 Version A

(Section not present)

🟢 Version B

(Section not present)

🟡 Version C

Constrained generation is often sold as “just mask invalid tokens.” That hides the real split:

Incremental parsing (commit) is relatively well served by GLR/GSS techniques.
Mask generation (get_mask) is the true bottleneck if you care about p99.9 latency at scale.

My contribution/claim here is the reframing and the data structure:

Precompile the grammar+tokenizer into a bitset-weighted automaton over parser stack state IDs, and compute masks by reading the stack/GSS once with ∩/∪ bitset propagation.

That’s the idea I wanted to put on the record. If there’s interest, the next post would be about the “slow compile” side: how to build/determinize/minimize these automata without exponential blowups, and how to represent the weights without storing a 25KB bitset per edge.