1# Parsing Model Output
  2
  3The `common` library contains a PEG parser implementation suitable for parsing
  4model output.
  5
  6Types with the prefix `common_peg_*` are intended for general use and may have
  7applications beyond parsing model output, such as parsing user-provided regex
  8patterns.
  9
 10Types with the prefix `common_chat_peg_*` are specialized helpers for model
 11output.
 12
 13The parser features:
 14
 15- Partial parsing of streaming input
 16- Built-in JSON parsers
 17- AST generation with semantics via "tagged" nodes
 18
 19## Example
 20
 21Below is a contrived example demonstrating how to use the PEG parser to parse
 22output from a model that emits arguments as JSON.
 23
 24```cpp
 25auto parser = build_chat_peg_native_parser([&](common_chat_peg_native_builder & p) {
 26    // Build a choice of all available tools
 27    auto tool_choice = p.choice();
 28    for (const auto & tool : tools) {
 29        const auto & function = tool.at("function");
 30        std::string name = function.at("name");
 31        const auto & schema = function.at("parameters");
 32
 33        auto tool_name = p.json_member("name", "\"" + p.literal(name) + "\"");
 34        auto tool_args = p.json_member("arguments", p.schema(p.json(), "tool-" + name + "-schema", schema));
 35
 36        tool_choice |= p.rule("tool-" + name, "{" << tool_name << "," << tool_args << "}");
 37    }
 38
 39    // Define the tool call structure: <tool_call>[{tool}]</tool_call>
 40    auto tool_call = p.trigger_rule("tool-call",
 41        p.sequence({
 42            p.literal("<tool_call>["),
 43            tool_choice,
 44            p.literal("]</tool_call>")
 45        })
 46    );
 47
 48    // Parser accepts content, optionally followed by a tool call
 49    return p.sequence({
 50        p.content(p.until("<tool_call>")),
 51        p.optional(tool_call),
 52        p.end()
 53    });
 54});
 55```
 56
 57For a more complete example, see `test_example_native()` in
 58[tests/test-chat-peg-parser.cpp](/tests/test-chat-peg-parser.cpp).
 59
 60## Parsers/Combinators
 61
 62### Basic Matchers
 63
 64- **`eps()`** - Matches nothing and always succeeds (epsilon/empty match)
 65- **`start()`** - Matches the start of input (anchor `^`)
 66- **`end()`** - Matches the end of input (anchor `$`)
 67- **`literal(string)`** - Matches an exact literal string
 68- **`any()`** - Matches any single character (`.`)
 69
 70### Combinators
 71
 72- **`sequence(...)`** - Matches parsers in order; all must succeed
 73- **`choice(...)`** - Matches the first parser that succeeds from alternatives (ordered choice)
 74- **`one_or_more(p)`** - Matches one or more repetitions (`+`)
 75- **`zero_or_more(p)`** - Matches zero or more repetitions (`*`)
 76- **`optional(p)`** - Matches zero or one occurrence (`?`)
 77- **`repeat(p, min, max)`** - Matches between min and max repetitions (use `-1` for unbounded)
 78- **`repeat(p, n)`** - Matches exactly n repetitions
 79
 80### Lookahead
 81
 82- **`peek(p)`** - Positive lookahead: succeeds if parser succeeds without consuming input (`&`)
 83- **`negate(p)`** - Negative lookahead: succeeds if parser fails without consuming input (`!`)
 84
 85### Character Classes & Utilities
 86
 87- **`chars(classes, min, max)`** - Matches repetitions of characters from a character class
 88- **`space()`** - Matches zero or more whitespace characters (space, tab, newline)
 89- **`until(delimiter)`** - Matches characters until delimiter is found (delimiter not consumed)
 90- **`until_one_of(delimiters)`** - Matches characters until any delimiter in the list is found
 91- **`rest()`** - Matches everything remaining (`.*`)
 92
 93### JSON Parsers
 94
 95- **`json()`** - Complete JSON parser (objects, arrays, strings, numbers, booleans, null)
 96- **`json_object()`** - JSON object parser
 97- **`json_array()`** - JSON array parser
 98- **`json_string()`** - JSON string parser
 99- **`json_number()`** - JSON number parser
100- **`json_bool()`** - JSON boolean parser
101- **`json_null()`** - JSON null parser
102- **`json_string_content()`** - JSON string content without surrounding quotes
103- **`json_member(key, p)`** - JSON object member with specific key and value parser
104
105### Grammar Building
106
107- **`ref(name)`** - Creates a lightweight reference to a named rule (for recursive grammars)
108- **`rule(name, p, trigger)`** - Creates a named rule and returns a reference
109- **`trigger_rule(name, p)`** - Creates a trigger rule (entry point for lazy grammar generation)
110- **`schema(p, name, schema, raw)`** - Wraps parser with JSON schema metadata for grammar generation
111
112### AST Control
113
114- **`atomic(p)`** - Prevents AST node creation for partial parses
115- **`tag(tag, p)`** - Creates AST nodes with semantic tags (multiple nodes can share tags)
116
117## GBNF Grammar Generation
118
119The PEG parser also acts as a convenient DSL for generating GBNF grammars, with
120some exceptions.
121
122```cpp
123data.grammar = build_grammar([&](const common_grammar_builder & builder) {
124    foreach_function(params.tools, [&](const json & fn) {
125        builder.resolve_refs(fn.at("parameters"));
126    });
127    parser.build_grammar(builder, data.grammar_lazy);
128});
129```
130
131The notable exception is the `negate(p)` lookahead parser, which cannot be
132defined as a CFG grammar and therefore does not produce a rule. Its usage
133should be limited and preferably hidden behind a `schema()` parser. In many
134cases, `until(delimiter)` or `until_one_of(delimiters)` is a better choice.
135
136Another limitation is that the PEG parser requires an unambiguous grammar. In
137contrast, the `llama-grammar` implementation can support ambiguous grammars,
138though they are difficult to parse.
139
140### Lazy Grammars
141
142During lazy grammar generation, only rules reachable from a `trigger_rule(p)`
143are emitted in the grammar. All trigger rules are added as alternations in the
144root rule. It is still necessary to define trigger patterns, as the parser has
145no interaction with the grammar sampling.
146
147### JSON Schema
148
149The `schema(p, name, schema, raw)` parser will use the `json-schema-to-grammar`
150implementation to generate the grammar instead of the underlying parser.
151
152The `raw` option emits a grammar suitable for a raw string instead of a JSON
153string. In other words, it won't be wrapped in quotes or require escaping
154quotes. It should only be used when `type == "string"`.
155
156The downside is that it can potentially lead to ambiguous grammars. For
157example, if a user provides the pattern `^.*$`, the following grammar may be
158generated:
159
160```
161root ::= "<arg>" .* "</arg>"
162```
163
164This creates an ambiguous grammar that cannot be parsed by the PEG parser. To
165help mitigate this, if `.*` is found in the pattern, the grammar from the
166underlying parser will be emitted instead.
167
168## Common AST Shapes for Chat Parsing
169
170Most model output can be placed in one of the following categories:
171
172- Content only
173- Tool calling with arguments emitted as a single JSON object
174- Tool calling with arguments emitted as separate entities, either XML
175  (Qwen3-Coder, MiniMax M2) or pseudo-function calls (LFM2)
176
177To provide broad coverage,
178[`common/chat-peg-parser.h`](/common/chat-peg-parser.h) contains builders and
179mappers that help create parsers and visitors/extractors for these types. They
180require parsers to tag nodes to conform to an AST "shape". This normalization
181makes it easy to extract information and generalize parsing.
182
183### Simple
184
185The `common_chat_peg_builder` builds a `simple` parser that supports
186content-only models with optional reasoning.
187
188- **`reasoning(p)`** - Tag node for extracting `reasoning_content`
189- **`content(p)`** - Tag node for extracting `content`
190
191```cpp
192build_chat_peg_parser([&](common_chat_peg_parser & p) {
193    return p.sequence({
194        p.optional("<think>" + p.reasoning(p.until("</think>")) + "</think>"),
195        p.content(p.until("<tool_call>")),
196        p.end()
197    });
198});
199```
200
201Use `common_chat_peg_mapper` to extract the content. Note that this is already
202done for you in `common_chat_peg_parser` when
203`chat_format == COMMON_CHAT_FORMAT_PEG_SIMPLE`.
204
205```cpp
206auto result = parser.parse(ctx);
207
208common_chat_msg msg;
209auto mapper = common_chat_peg_mapper(msg);
210mapper.from_ast(ctx.ast, result);
211```
212
213### Native
214
215The `common_chat_peg_native_builder` builds a `native` parser suitable for
216models that emit tool arguments as a direct JSON object.
217
218- **`reasoning(p)`** - Tag node for `reasoning_content`
219- **`content(p)`** - Tag node for `content`
220- **`tool(p)`** - Tag entirety of a single tool call
221- **`tool_open(p)`** - Tag start of a tool call
222- **`tool_close(p)`** - Tag end of a tool call
223- **`tool_id(p)`** - Tag the tool call ID (optional)
224- **`tool_name(p)`** - Tag the tool name
225- **`tool_args(p)`** - Tag the tool arguments
226
227```cpp
228build_chat_peg_native_parser([&](common_chat_peg_native_parser & p) {
229    auto get_weather_tool = p.tool(p.sequence({
230        p.tool_open(p.literal("{")),
231        p.json_member("name", "\"" + p.tool_name(p.literal("get_weather")) + "\""),
232        p.literal(","),
233        p.json_member("arguments", p.tool_args(p.json())),
234        p.tool_close(p.literal("}"))
235    }));
236
237    return p.sequence({
238        p.content(p.until("<tool_call>")),
239        p.literal("<tool_call>"),
240        get_weather_tool,
241        p.literal("</tool_call>"),
242        p.end()
243    });
244});
245```
246
247### Constructed
248
249The `common_chat_peg_constructed_builder` builds a `constructed` parser
250suitable for models that emit tool arguments as separate entities, such as XML
251tags.
252
253- **`reasoning(p)`** - Tag node for `reasoning_content`
254- **`content(p)`** - Tag node for `content`
255- **`tool(p)`** - Tag entirety of a single tool call
256- **`tool_open(p)`** - Tag start of a tool call
257- **`tool_close(p)`** - Tag end of a tool call
258- **`tool_name(p)`** - Tag the tool name
259- **`tool_arg(p)`** - Tag a complete tool argument (name + value)
260- **`tool_arg_open(p)`** - Tag start of a tool argument
261- **`tool_arg_close(p)`** - Tag end of a tool argument
262- **`tool_arg_name(p)`** - Tag the argument name
263- **`tool_arg_string_value(p)`** - Tag string value for the argument
264- **`tool_arg_json_value(p)`** - Tag JSON value for the argument
265
266```cpp
267build_chat_peg_constructed_parser([&](common_chat_peg_constructed_builder & p) {
268    auto location_arg = p.tool_arg(
269        p.tool_arg_open("<parameter name=\"" + p.tool_arg_name(p.literal("location")) + "\">"),
270        p.tool_arg_string_value(p.until("</parameter>")),
271        p.tool_arg_close(p.literal("</parameter>"))
272    );
273
274    auto get_weather_tool = p.tool(p.sequence({
275        p.tool_open("<function name=\"" + p.tool_name(p.literal("get_weather")) + "\">"),
276        location_arg,
277        p.tool_close(p.literal("</function>"))
278    }));
279
280    return p.sequence({
281        p.content(p.until("<tool_call>")),
282        p.literal("<tool_call>"),
283        get_weather_tool,
284        p.literal("</tool_call>"),
285        p.end()
286    });
287});
288```