name: structured-output-extraction
description: Use when extracting structured, typed data from unstructured text with an LLM (parsing documents, classifying, pulling fields) and the downstream code needs a guaranteed shape.
Structured Output Extraction
When to trigger
The task is "read this text and give me these fields as data": parsing an email into a ticket, pulling line items from an invoice, classifying a message into an enum.
Do not parse free text
Asking for JSON in the prompt and then parsing the reply is the fragile path. The model wraps it in prose, adds a trailing comment, or uses the wrong key. Instead:
- Use the provider's structured-output or tool-calling mode so the model is constrained to your schema.
- Define the schema once and share it between the model call and your validator (for example a zod schema you also convert to the tool input schema).
Steps
- Write the schema as the source of truth: required fields required, enums for closed sets, descriptions on ambiguous fields so the model fills them correctly.
- Call the model in structured mode with that schema.
- Validate the result against the same schema. A validation failure is a signal, not a surprise.
- On failure, run one repair pass: send the model its own invalid output and the validation error and ask it to fix the shape. Cap repairs at one or two, then fall back.
Pitfalls
- Enums drift: the model returns "In Progress" when the schema wants "in_progress". Put the exact allowed values in the field description and validate.
- Optional vs required confusion produces nulls the downstream code did not expect. Make required fields required in the schema, not just in the prompt.
- An unbounded repair loop can spin on genuinely unparseable input. Cap the retries and route the leftover to a human or a fallback record.