This document interleaves three versions of the article for comparison:
constrained-decoding-weighted-automata.md (original)constrained-decoding-weighted-automata 1.md (expanded)Stack-Driven Masking...md (polished)I want to share a take on grammar-constrained generation I've been working on for a while. This post is a preliminary writeup: enough detail to communicate the core read-GLR-stack-with-a-weighted-automata idea (and timestamp it). I’ll likely follow up with a post on the various tricks you need to actually build the damn things later.
I want to share a take on grammar-constrained generation I’ve been working on for a while. This post is a preliminary writeup: enough detail to communicate the core “read the (G)LR stack with a weighted automaton” idea (and timestamp it). I’ll likely follow up with a post on the various tricks you need to actually build the damn things without the compiler exploding.
This post is intentionally a “foot-in-the-door” writeup: enough detail to communicate the core idea (and timestamp it), not a full implementation guide. I’ll likely follow up with the compile-time tricks later.
You have a grammar (often a JSON Schema) and you want the LLM to produce grammatically valid output.
At each step, the LLM outputs logits over the vocabulary . You sample a token. Append it to the output. Repeat.
Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.
So we maintain a constraint state and, for each step, compute a mask of allowed tokens, and apply it to the logits.
No more broken JSON.
You have a grammar (often a JSON Schema, or “JSON-but-with-some-fields”, or a programming-language subset) and you want the LLM to produce grammatically valid output.
At each step, the LLM outputs logits over the vocabulary . You sample a token. Append it to the output. Repeat.
Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.
So we maintain a constraint state and, for each step, compute a mask of allowed tokens, and apply it to the logits.
No more broken JSON.
A precise way to say “allowed” (this matters later):
A token is valid given the current prefix if there exists some continuation such that is in the grammar language.
Equivalently, the mask is
That “” is the whole game: we’re not asking “does this token complete a valid document right now?”, we’re asking “could it still lead to a valid completion?”
(Section not present)
Think of it as two functions:
get_mask(state) returns a bitmask over the LLM vocabulary.commit(state, token) updates the parser state with the chosen token.At runtime the loop looks like:
loop:
parallel:
GPU: logits
CPU: mask
apply mask
sample token
commit token to model and constraint
Think of constrained decoding as two functions:
get_mask(state) -> bitmask: returns a bitmask over the LLM vocabulary.commit(state, token) -> state: updates the constraint state with the chosen token.At runtime the loop looks like:
loop:
parallel:
GPU: logits
CPU: mask = get_mask(state)
apply mask to logits
sample token
state = commit(state, token)
Nothing fancy, except: get_mask is on the critical path of decoding.
It’s useful to separate constrained decoding into two methods:
get_mask(state) -> bitset[V]
Returns which LLM tokens are valid next from this state.
commit(state, token)
Update the constraint state after that token has been chosen.
At runtime you want a tight loop like:
loop:
parallel:
GPU: compute logits
CPU: compute mask
apply mask to logits
sample token
commit token to:
- model KV-cache
- constraint state
This split matters because:
commit is “do work proportional to what we actually emitted” (one token).get_mask is “do work every step no matter what” (critical path).LLM inference is batched: multiple user requests run on the GPU at once. And mask generation sits directly on the critical decoding path.
If you want to generate 1k tokens per second, you have 1 millisecond to produce a mask. And if you miss even a single one - if a mask takes 2ms instead of 1 - it screws up scheduling. Masks can't be late; they have to arrive on time every time. It's worst-case performance that matters.
It’s not “can you do masking fast on average?” It’s “can you do it fast every single step?”
LLM inference is batched: multiple user requests run on the GPU at once, and each step wants a mask at the same time.
If you want to generate ~1k tokens/sec, you have ~1ms to produce a mask. And if you miss even a single one—if a mask takes 2ms instead of 1—then in a batched setting it screws up scheduling: the GPU sits waiting on a straggler CPU mask, and suddenly your p99 latency is trash.
Masks can’t be late; they have to arrive on time every step. It’s worst-case performance that matters.
It’s not “can you do masking fast on average?”
It’s “can you do it fast every single step?”
(Section not present)
The commit op is analogous to incremental parsing. You're incrementally parsing a growing prefix. Many battle-tested systems like tree-sitter and Lezer use Generalized LR (GLR) parsing with a graph-structured stack (GSS).
The LR parser maintains a stack of state IDs and decides what to do by reading the top of the stack plus the next input symbol. The "Generalized" part handles ambiguity: instead of a single stack, you keep a graph of possible stacks sharing structure.
Glossary
--.The commit op is basically incremental parsing: you’re incrementally parsing a growing prefix.
Many battle-tested systems like tree-sitter and Lezer use Generalized LR (GLR) parsing with a graph-structured stack (GSS).
So commit is not the “new idea” here. commit is: use a decent incremental GLR parser and keep it compact.
The interesting part is get_mask.
Glossary
--, IDENT, STRING.(Section not present)
Now you have a parse state (often a GSS), and you want a mask over LLM tokens.
A straightforward approach is:
For each LLM token , check whether appending it keeps the parse valid.
But you don't append an LLM token to the parser. You append terminal symbols. A single LLM token might expand to multiple terminals, and that segmentation can be ambiguous.
Now you have a parse state (often a GSS), and you want a mask over LLM tokens.
A straightforward approach is:
For each LLM token , check whether appending it keeps the parse valid.
But you don’t append an LLM token to the parser. You append terminal symbols.
And a single LLM token is a byte string which might:
"}\n" could be RBRACE, NEWLINE, …),true),So “try token ” is really “try every way ’s bytes could advance the lexer into some terminal sequence, and then try advancing the parser for that terminal sequence.”
That mismatch is where a lot of “obvious” masking schemes get wrecked.
(Section not present)
Consider: in a lexer where - and -- are both valid terminals, a run of N dashes has possible segmentations (Fibonacci growth). The longest non-space token in cl200k (GPT-5's tokenizer) is 112 dashes. That's segmentations to consider—for a single candidate token.
And there are 200k tokens to check.
(Section not present)
(Section not present)
A common direction is:
For example, (our ------------- token can be turned into a trie blah blah)
This can work well for many grammars and workloads. But for general CFG-ish constraints, there are still pathological cases.
You might look at our ------------- example and be clever and say 'well a - symbol can never follow a -- and vice-versa, so we can eliminate a bunch of edges'. Yes. But we're still stuck with a long sequence of - that we need to parse every time we wanna generate a mask. And yes, any arbitrarily long sequence of -s is IS perfectly valid (albeit horrible) in Python, Javascript etc., e.g. ------------------1 (include snippet of running it in Python REPL and it evaluating).
So you're still going to end up processing a lot of terminals every time you want to generate a mask. And you're still doing GSS-heavy parser work for each terminal.
You can transform the grammar to remove unit and null reductions. You can carefully merge and simplify the GSS as you go.
But try as you might, a single terminal shift can trigger many reductions which, in GLR, tend to fan out. Ultimately those operations are pointer-heavy and cache-unfriendly, and hurt worst-case performance in a manner that's hard to engineer away.
I tried hard to make this fast. It's a dead end.
A common direction is:
This can work well for many grammars and workloads. People implement this and get decent average performance, especially for “simple” grammars.
But for general CFG-ish constraints (especially when you care about worst-case), you run into two issues:
Even if your lexical side is deterministic (say you enforce longest-match, or you have a DFA lexer), a single grammar terminal shift can trigger a chain of reductions.
In an LR parser, “shift terminal ” is often really:
Those reduction chains are table-driven and fast in the happy path, but in GLR they can fan out. The operations are pointer-heavy and cache-unfriendly: you’re manipulating a GSS, merging nodes, pruning branches, etc.
If you do that inside masking—i.e. while traversing a trie of possible tokens—you’re effectively building and mutating a GSS for a hypothetical next token, and doing it potentially thousands of times per step.
That’s exactly the kind of p99.9 work you don’t want on the critical path.
Even with tries, there are tokens whose byte strings correspond to long sequences of terminals (think: long runs of punctuation, or tokens containing lots of structural characters plus whitespace/newlines). Those paths can survive pruning for a while, so you end up doing a lot of terminal-level processing per mask computation.
And yes, long ugly sequences can be syntactically valid in real languages. Python, JS, etc. happily accept monstrosities like:
------------------1
That means “a lot of unary minus operators.” It’s valid, and it forces the parser to do real work on a long terminal sequence.
You can transform the grammar to remove unit/null reductions. You can aggressively merge equivalent GSS nodes. You can add clever memoization.
I tried hard to make the “trie + incremental parse simulation” approach behave well in worst-case latency terms. In my experience, it’s a dead end if you’re aiming for predictable sub-millisecond masking on arbitrary inputs/grammars.
A common direction is:
This can work well for many grammars and many workloads. My claim is narrower:
For general CFG-ish constraints + large vocabularies + strict worst-case deadlines, this approach has pathological cases that are hard to engineer away.
The big practical pain points are:
That’s what I mean by “dead end” in the original draft: not “nobody can make it fast,” but “I couldn’t get a per-token / per-trie-path simulation approach to have predictable p99.9 behavior at scale without the complexity ballooning.”
Token validity depends on what's on the stack. Instead of asking "does LLM token t work on this stack?" for each token, ask "given this stack, which LLM tokens work?"
That's the reframing. Now we need a data structure that turns "stack → allowed tokens" into something we can execute fast.
That’s where the weighted automaton comes in.
Token validity depends on what’s on the stack.
Instead of asking:
“does LLM token work on this stack?” (for each )
ask:
“given this stack, which LLM tokens work?”
That’s the reframing.
Now we want a data structure that turns:
stack → allowed-token bitset
into something we can execute quickly.
That’s where the weighted automaton comes in.
(Section not present)
Think of a finite automaton, except each transition carries a weight, and traversing paths combines weights.
In our weighted automata:
The automaton reads (a representation of) your current parse configuration:
Think of a finite automaton, except:
In the automata I’m talking about:
So it’s like “run a DFA on the stack”, but with “which tokens survive” flowing through it.
The automaton reads a representation of your current parse configuration:
A token is valid if it’s valid on any parse path in the GSS, so on a GSS you just union the results across paths.
(Section not present)
Another way to see it (which helped me):
That set of stacks is something you can represent with a finite automaton over stack state IDs (a kind of "language of stacks").
Of course, if you did that for every , you'd have 200k automata. That's useless at runtime.
A weighted automaton is how you smash those 200k membership tests into one run:
(Section not present)
(Section not present)
It's also trivial to modify the weighted automata to stop early when reading when further stack depth cannot change the result.
(See subtract_final_weights_from_outgoing for actual code. Don't cite the code, obviously. Explain it.)
In our "language of stacks", this is equivalent to removing a stack from the language when a prefix of it already exists.
(Section not present)
(Section not present)
For a single LR stack, the runtime looks like:
def get_mask(stack_state_ids_top_to_bottom):
# frontier: map automaton_state -> bitset(tokens)
frontier = { A.start: ALL_TOKENS }
for sid in stack_state_ids_top_to_bottom:
new = {}
for a_state, tokens in frontier.items():
for (a2, weight) in A.step(a_state, sid):
tokens2 = tokens & weight # intersection (filter)
if tokens2.any():
new[a2] = new.get(a2, EMPTY) | tokens2 # union (merge paths)
frontier = new
if decided(frontier):
break
return combine_accepting(frontier)
For a GSS, you do basically the same computation but over a graph rather than a single list. (Details omitted here, but the point is: it’s the same ∩/∪ propagation pattern.)
(Section not present)
For a single LR stack, the runtime looks like:
def get_mask(stack_state_ids_top_to_bottom):
# frontier: map automaton_state -> bitset(tokens)
frontier = { A.start: ALL_TOKENS }
for sid in stack_state_ids_top_to_bottom:
new = {}
for a_state, tokens in frontier.items():
for (a2, weight) in A.step(a_state, sid):
tokens2 = tokens & weight # intersection (filter)
if tokens2.any():
new[a2] = new.get(a2, EMPTY) | tokens2 # union (merge paths)
frontier = new
if decided(frontier):
break
return combine_accepting(frontier)
For a GSS, you do the same computation but over a graph rather than a single list:
(Details omitted here, but the point is: it’s the same ∩/∪ propagation pattern.)
Putting it together:
commit(token): (Should be something else? commit(state, token) -> state??)
get_mask(state) (get_mask(state) -> mask)
(Section not present)
(Section not present)
I’m avoiding “this is optimal” as a theorem. (Blah blah blah I mean GLR doesn't really have 'this is optimal' theorems, but it feels optimal. Maybe just because so many people use it. Maybe it's just because it's fast. Maybe there's some deeper reason. IDK. But it does feel optimal.)
On the commit side, GLR-style techniques are already a strong fit for incremental CFG parsing. Table-driven decisions, compact ambiguity representation via GSS. There's a reason people working on these kinds of problems keep converging on GLR.
On the mask side, the weighted automaton performs the minimum work you'd reasonably hope for: read the stack once, never backtrack, stop when determined. Inner loops are bitset intersections and unions—fast, predictable.
This doesn’t magically resolve the lack of strict bounds on GLR parser time complexity, behaviour in pathological cases. You still inherit those. (Which are?? Theorems? Lack of theorems proving efficiency?) But... blah blah.
I’m deliberately avoiding “this is optimal” as a theorem. Parsing isn’t a domain where you get many satisfying worst-case optimality results that also match real-world constraints.
But it feels close to optimal for two reasons:
(Section not present)
Precompiling the grammar into a weighted automaton involves determinization and minimization in a semiring setting. Large grammars can blow up if you're careless.
Getting compile-time and memory to behave took most of my engineering effort. That's probably the next post.
The cost shifts to compile time.
Precompiling the grammar into this Parser DWA involves determinization and simplification in a semiring-ish setting (bitset weights, union at merges, intersection along paths). If you do it naively, large grammars can blow up in memory/time.
Getting compile-time and memory to behave took most of my engineering effort:
That’s probably the next post.
If you’re building constrained decoding and you care about p99.9 latency, my main takeaway is:
get_mask,commit do the parsing once for the chosen token,get_mask a precomputed “read stack → bitset” computation.That inversion is the whole trick.
(Section not present)
(Section not present)
Here’s a toy example that captures the shape of the problem.
Suppose your terminals include - and -- (or more generally, terminals that overlap). If you treat tokenization as “split into terminals in any way that works” (scannerless style), then a run of dashes has possible segmentations (Fibonacci growth), because you’re tiling length using pieces of length 1 and 2.
And tokenizers absolutely contain long weird tokens. In cl200k there are tokens that are on the order of 100+ repeated punctuation characters. For , is segmentations.
That’s for a single candidate token.
And there are ~200k tokens to check.
This is obviously not something you can do per decoding step.
(Section not present)
(Section not present)
Another mental model that helped me:
That set of stacks is a regular language over parser state IDs (this is closely related to the classical “viable prefixes are regular” result from LR theory).
So you can imagine an automaton that recognizes “stacks on which token is valid.”
Of course, if you do that for every , you’d have 200k automata. That’s useless at runtime.
A weighted automaton is how you smash those 200k membership tests into one run:
Same computation, but vectorized across tokens via bitsets.
(Section not present)
(Section not present)
If you implement the obvious “read stack symbols until you hit bottom” approach, you’ll still sometimes do more work than needed: often, after reading a small suffix of the stack, the set of valid tokens is already determined and deeper stack symbols can’t change it.
You can bake this into the automaton:
Operationally, it’s the difference between:
So you can stop when there’s nothing left in flight.
I’m not going to cite code here, but conceptually it’s just a precomputed fixed-point thing: for each automaton state, compute which tokens are guaranteed from this point (“final weight”), subtract them from the weights you propagate forward, and you get a monotone decreasing set that hits empty quickly.
(Section not present)
(Section not present)
For a single LR stack, the runtime looks like:
def get_mask(stack_state_ids_top_to_bottom):
# frontier: map automaton_state -> bitset(tokens)
frontier = { A.start: ALL_TOKENS }
decided = EMPTY # tokens already known valid from suffix read so far
for sid in stack_state_ids_top_to_bottom:
new = {}
for a_state, tokens in frontier.items():
for (a2, weight) in A.step(a_state, sid):
tokens2 = tokens & weight # intersection (filter)
if tokens2.any():
new[a2] = new.get(a2, EMPTY) | tokens2 # union (merge paths)
frontier = new
# Optionally accumulate “final/decided” contribution here and
# remove it from frontier tokens, so frontier only contains
# “still-undecided” tokens.
if not frontier: # nothing left that depends on deeper stack
break
return decided | combine_accepting(frontier)
This is the key operational point:
(Section not present)
(Section not present)
For a GSS, you do basically the same computation, but over a graph rather than a single list.
Conceptually:
A token is valid if it’s valid on any stack path, so whenever GSS paths merge you union their token sets, and whenever automaton paths merge you union there too.
The important part is: it’s still the same “bitset flows through a graph via ∩ and ∪” pattern. No backtracking, no “try token ” loop.
(Section not present)
(Section not present)
Up to now I’ve treated “the weighted automaton” as magic.
It’s not magic, it’s just precomputation.
At a high level, you want an automaton that answers:
given the current lexer+parser state (represented by the stack/GSS), which LLM tokens could lead to a valid continuation?
There are two distinct problems mixed together:
The thing I compile is basically a composition of two automata:
I’ll keep this section high-level (the details are where the compile-time pain is), but the decomposition matters because it explains why the runtime is so simple.
(Section not present)
(Section not present)
An LLM token is a byte string.
A grammar tokenizer/lexer consumes a stream of bytes and emits grammar terminals. Crucially:
So the mapping “token → terminal sequence” is not a single fixed lookup. It depends on the current lexer state.
The Terminal DWA is the precomputed structure that answers:
from lexer state , which LLM tokens can produce which terminal next (and what lexer state do we end up in)?
A practical way to build it is:
Then determinize.
The key thing the Terminal DWA buys you is: it collapses “iterate over 200k tokens and run the lexer” into “traverse a small automaton state space and get bitsets.”
(Section not present)
(Section not present)
Now suppose the lexer says “the next terminal is .” What does the LR parser do?
You can precompute, for each terminal , an automaton that reads stack state IDs (top down) and represents all the ways the parser could legally process , including the reduction chain.
Internally I represent the net stack transformation using a push/pop algebra (the polycyclic monoid is the clean formalism), but you don’t need to care about that to get the intuition:
This is the piece that takes “pointer-heavy GLR reduction simulation” off the runtime path and turns it into precomputed transitions.
(Section not present)
(Section not present)
Finally, you compose:
to get one deterministic weighted automaton that:
This is the automaton you run in get_mask.
One subtle but very useful algebraic fact during composition: when you concatenate stack-effect templates, you get adjacent “push then pop” pairs that cancel. That lets you simplify aggressively while building the composed automaton (and prune impossible paths when pushes/pops mismatch).
Also: at runtime I don’t actually need the automaton to update the stack. commit is implemented by the real lexer+parser. get_mask just needs to know whether a token is valid, so the automaton can discard some “push” bookkeeping once it has done its filtering job.
(Section not present)
(Section not present)
A question I always get is: “do you really scan the entire stack every time?”
No. In practice you only need a bounded suffix, and you can make that precise.
Two related reasons:
In implementation terms: masks stabilize quickly, and you can early-exit.
(Section not present)
(Section not present)
The term “token” is overloaded:
IDENT, "{", NUMBER, keywords, etc.).These tokenizations don’t align.
Even if your lexer uses longest-match, longest-match is inherently forward-looking: you can’t know a match is final until you see what comes next.
Classic example:
"+" followed by "+" should yield INCREMENT ("++"), not two PLUS.In a streaming setting, after you see the first "+", you’re in a state where:
PLUS,INCREMENT.So you have to represent that uncertainty somehow.
The approach I use is what I call inhibited terminals (this is just a convenient operational trick, not a deep theory term):
This plays nicely with GLR+GSS, because “fork and prune” is already the parser’s native move.
Practically, the constraint state you carry around is not just “parser stack(s)”; it’s “(parser stack(s), lexer state(s))”. get_mask needs to condition on both, which is why the Parser DWA has multiple initial states (one per active lexer state).
(Section not present)
(Section not present)
At runtime you now have two different kinds of work, with very different performance profiles:
(Section not present)
commit(state, token) -> state(Section not present)
This is the inherently “CFG-y” part. It can still have pathological cases (that’s life), but you at least get decades of parsing engineering behind it.
(Section not present)
get_mask(state) -> mask(Section not present)
This is the performance-critical part, and it’s exactly where precomputation buys you predictability.
(Section not present)
(Section not present)
For incremental CFG parsing with ambiguity:
…is a pretty hard baseline to beat. There’s a reason people working on these problems keep converging on GLR-like techniques:
You still inherit GLR’s lack of comforting worst-case bounds in “fully adversarial ambiguity” settings. In theory, GLR can degrade badly on highly ambiguous grammars/inputs. In practice, for “programming-language-ish” grammars with reasonable disambiguation, it behaves well.
(Section not present)
(Section not present)
The weighted-automaton mask computation does the minimum work you’d reasonably hope for:
In other words: the runtime work scales with the size of the parse configuration, not with vocabulary size and not with “how gnarly are the tokens.”
(Section not present)
(Section not present)
(Section not present)
Grammar-constrained generation has two very different problems hiding under one name:
commit(token): update the constraint state after you sampled a token. This is incremental parsing.get_mask(state): given the current constraint state, return the set of next LLM tokens that keep you valid. This is the hard part on the critical path.For commit, incremental parsing techniques (LR/GLR + a graph-structured stack) are a good fit.
For get_mask, the usual “simulate candidate tokens through the parser” approach ties runtime to vocabulary size and tokenization ambiguity in ways that are brutal for p99.9 latency.
The idea I want to put on the record:
Invert the problem. Precompile the grammar+tokenizer into a bitset-weighted automaton whose input is the parser stack (or GSS). Then
get_maskbecomes a single pass that reads stack state IDs and does bitset ∩/∪ operations—no per-vocabulary-token simulation.
(Section not present)
(Section not present)
You have a grammar (often “JSON Schema”, though in practice this means the structural subset you can compile into something CFG-like), and you want the LLM to only emit valid outputs.
At each decoding step the model produces logits over the vocabulary . You sample a token. You append it to the output. Repeat.
Constrained decoding is: don’t sample tokens that would make the output invalid.
So we maintain a constraint state and, for each step, compute a mask of allowed tokens.
No more broken JSON—at least syntactically.
(Section not present)
(Section not present)
Inference servers batch many sequences on the GPU. The CPU-side mask has to arrive “in time” for each decode step, across the whole batch.
If the GPU step latency is ~1ms in your throughput regime, then you need your mask computation to reliably finish within that same window for every active sequence. A single slow mask can stall the whole batch, which cascades into ugly scheduling behavior.
So it’s not “can you do masking fast on average?” It’s “can you do it fast every single step?”
That’s the lens for everything below.
commit: incremental parsing is a good mental model(Section not present)
(Section not present)
When you call commit(state, token), you are incrementally parsing a growing prefix. This is a well-studied problem.
(Section not present)
(Section not present)
There are three different “alphabets” floating around:
--, Identifier, etc.).Also:
(Section not present)
(Section not present)
If your constraints have nondeterminism (common once you compile real schema-ish structures, alternatives, optionality, “oneOf”-style branching, etc.), you often end up with “multiple parses remain viable so far.”
The GLR family’s core trick is: don’t backtrack, represent the ambiguity compactly. You keep a set of stack “heads” in a shared graph.
This tends to work well for commit because:
So far so good.
get_mask: this is where things get ugly(Section not present)
(Section not present)
Now you have a parse state (often a GSS), and you want a mask over LLM tokens.
A straightforward approach is:
For each LLM token , check whether appending it keeps the parse valid.
But you don’t actually append “an LLM token” to the parser. You append:
Doing that per vocabulary token is a non-starter.
(Section not present)
(Section not present)
Here’s the precise point I want to make with the “112 dashes” anecdote:
Even if you don’t like the specific tokenizer example, incremental lexing can be ambiguous, and ambiguity can explode.
In a toy lexer where the only terminals are - and --, a run of dashes has segmentations (Fibonacci growth). For , that’s on the order of ways to segment.
Why does this matter to masking?
Because the naive masking algorithm effectively asks:
“Is there any way to segment this LLM token’s bytes into terminals such that the parser can consume them from the current state?”
If you handle “token → terminal sequence(s)” by exploring a tokenization trellis, then in the worst case you’re exploring an exponential number of segmentations per candidate LLM token.
Most practical systems avoid the full explosion with tries/trellises and memoization. That helps. But it doesn’t change the fundamental shape of the problem: you’re still trying to answer a question about all 200k tokens by simulating “what if we took this token?”
(Section not present)
(Section not present)
The key reframing is:
Token validity depends on the parse stack(s).
So instead of asking “does token work on this stack?”, ask “given this stack, which tokens work?”
That sounds tautological, but it changes the engineering shape:
Now we need a data structure that turns “stack → allowed tokens” into something we can execute fast.
That’s where the weighted automaton comes in.
(Section not present)
(Section not present)
(Empty section)
(Section not present)
(Section not present)
Think of a finite automaton, except each transition carries a weight and traversing paths combines weights.
In my setting:
This is exactly the “semiring” you’d expect on sets:
(Section not present)
(Section not present)
The automaton’s input alphabet is the set of LR parser state IDs.
The automaton reads (a representation of) your current parse configuration:
So the automaton is not replacing the parser. It’s a precompiled masking machine:
Input: current stack / GSS (state IDs)
Output: bitset mask over LLM tokens valid next
(Section not present)
(Section not present)
Another way to see it (which helped me):
That set of stacks is something you can represent with a finite automaton over stack state IDs (a “stack language” of sorts).
If you did that for every , you’d have 200k automata. That’s useless at runtime.
A weighted automaton is how you smash those 200k membership tests into one run:
(Section not present)
(Section not present)
Mask computation becomes:
Critically:
The runtime is driven by what the parse state actually contains.
And you can early-exit when further stack depth cannot change the result (the decided(frontier) hook above), which is common if the automaton reaches a fixed point.
(Section not present)
(Section not present)
Putting it together:
commit(token) uses GLR machinery:
get_mask(state) uses the weighted automaton:
So you get a split where:
commit)get_mask)(Section not present)
(Section not present)
I’m avoiding “this is optimal” as a theorem, but the shape matches what I want in production:
You still have worst cases (deep stacks exist), but it’s a worst case that comes from the actual output structure (nesting depth), not from adversarial vocabulary/tokenization effects.
(Section not present)
(Section not present)
“JSON Schema” caveat: full JSON Schema includes semantic constraints (ranges, regexes, uniqueItems, etc.). This post is about the grammar-shaped core. You can layer semantic predicates on top, but that’s a separate axis.
Fast runtime, slow compile: precompiling into a weighted automaton involves determinization/minimization-like steps (in a semiring setting) and can blow up if you’re careless. Most of my engineering time went into making compile-time and memory behave on large grammars. That’s a follow-up.
Memory is real: a 200k-vocab bitset is ~25KB. You cannot naively hang a fresh 25KB bitset off every transition. You need sharing/interning/chunking/sparsity tricks. Again: follow-up.
This doesn’t magically remove ambiguity: GLR ambiguity still exists; the goal is to make mask computation robust in the presence of it.
(Section not present)
(Section not present)
Most constrained decoding systems I’ve seen in the LLM ecosystem do some variant of:
That can be great for many use cases (especially regular/near-regular constraints like JSON).
The approach I’m describing is different in the “mask” half:
get_mask is stack-driven rather than vocabulary-/trie-driven.I’m not claiming other approaches are “wrong.” I’m saying: if you care about strict worst-case deadlines in a batched inference setting, you want a get_mask whose cost is not dominated by or tokenization pathologies.
(Section not present)
(Section not present)
Constrained generation is often sold as “just mask invalid tokens.” That hides the real split:
commit) is relatively well served by GLR/GSS techniques.get_mask) is the true bottleneck if you care about p99.9 latency at scale.My contribution/claim here is the reframing and the data structure:
Precompile the grammar+tokenizer into a bitset-weighted automaton over parser stack state IDs, and compute masks by reading the stack/GSS once with ∩/∪ bitset propagation.
That’s the idea I wanted to put on the record. If there’s interest, the next post would be about the “slow compile” side: how to build/determinize/minimize these automata without exponential blowups, and how to represent the weights without storing a 25KB bitset per edge.