What is schema-based extraction?

Schema-based extraction is the technique of supplying an explicit schema (Pydantic, JSON Schema, Zod) to an LLM-driven extractor so the output is type-checked and predictable. The schema does two jobs at once: it tells the model what shape to produce, and it catches the cases where the model produced something that doesn't fit. Reduces hallucinations, eliminates parsing code, and turns 'pull data from this text' from a fragile pipeline into a single typed call.
What is schema-based extraction?
Before LLM-driven extraction, pulling structured data out of unstructured text was a multi-stage pipeline: a regex or selector to find the candidate, parsing logic to clean it, validation to catch mistakes. Each stage was a place to break, and "the model returned weird stuff this time" was a category of bug everyone shipped. Schema-based extraction is the realization that you can collapse the pipeline into one stage by handing the model a schema and asking it to fill the schema directly. The schema does two jobs at once — it tells the model what shape to produce, and it acts as the validator that catches the cases where it didn't.
What "schema" means in this context
The schema is whatever describes the desired output shape in a way the LLM provider can constrain on. The most common forms:
- Pydantic models (Python) — the most common in agent stacks, since Pydantic doubles as the runtime validation layer.
- JSON Schema (language-agnostic) — the underlying standard most LLM providers' "structured output" modes constrain on.
- Zod schemas (TypeScript / JavaScript) — the equivalent in JS/TS stacks.
- TypedDict, dataclass, attrs in Python; equivalents elsewhere — used as schema sources by libraries that translate them to JSON Schema.
The model providers (OpenAI's response_format, Anthropic's tool-use schemas, Google's controlled generation) all expose a way to constrain output to a schema. Schema-based extraction is using that constraint on the way out, not just generating free text and parsing it after.
Why this changes the failure mode
Without a schema, an LLM extracting data fails in three creative ways:
- Hallucinated fields. Returns a key the schema didn't ask for. Code that expects a flat shape blows up.
- Hallucinated values. Invents a price, a date, a name that wasn't in the source. Plausible-looking, completely wrong.
- Format drift. Returns a date as
"March 5, 2024"one run and"2024-03-05"the next. Downstream parsing breaks intermittently.
A schema closes all three. Constrained generation makes hallucinated fields impossible at the protocol layer; required-field validation catches missing values; type validation catches format drift. What's left — actual content errors, where the model picked the wrong number from the page — is a genuine accuracy problem schemas don't solve. That's a smaller surface than the unconstrained version.
Schema-based extraction vs. unconstrained extraction
| Unconstrained | Schema-based | |
|---|---|---|
| Output shape | Free text or unconstrained JSON | Schema-conformant |
| Hallucinated fields | Possible | Impossible (constraint layer) |
| Type safety in caller | Manual coercion | Automatic |
| Parse errors downstream | Common | Rare |
| Handles missing values | Silently or with a guess | Forces explicit Optional / null |
| Best for | Quick exploration, free-form summaries | Production extraction |
The empirical pattern: production extraction has converged on schema-based. Unconstrained extraction is the right answer for "summarize this page" or "what's the sentiment of this review" — outputs that aren't naturally typed. Anywhere the output is actually structured, the schema is the lower-effort path.
The Notte SDK shape
from notte_sdk import NotteClient
from pydantic import BaseModel
client = NotteClient()
class Article(BaseModel):
title: str
author: str
published_at: str # ISO 8601
word_count: int | None = None
result = client.scrape(
url="https://news.example.com/article/123",
response_format=Article,
)
article: Article = result.output
print(article.title, article.published_at)The schema is the only extraction logic — there's no separate parsing step, no field-by-field selectors. The Pydantic model tells the LLM what to produce, and Pydantic itself validates the output. Same call works with list[Article] for collections, with JSONSchema dicts directly, and (in TypeScript) with Zod schemas.
The closest cousin is page-to-JSON extraction — the API surface that wraps schema-based extraction into a single URL→record call. Schema-based is the technique; page-to-JSON is the product.
Common pitfalls
- Over-specifying the schema. Forcing every field as required when the source rarely has all of them. Validation fails on perfectly good rows. Use
Optional[...]liberally — the schema should match what's available, not what's ideal. - Schemas with vague type signatures.
strinstead ofLiteral["pending", "shipped", "delivered"]lets the model invent statuses. Constrain when the values are knowable. - Skipping
description/ docstrings on schema fields. The model uses these as hints. A field calledrevenue: floatextracts worse thanrevenue_usd: float = Field(description="Annual revenue in USD; null if not disclosed."). - Unbounded
dict[str, Any]. Defeats the whole purpose. If you need flexibility, model the union explicitly.
Key takeaways
- Schema-based extraction supplies a typed schema (Pydantic, JSON Schema, Zod) to an LLM-driven extractor so the output is constrained and validated.
- The schema does two jobs: tells the model what to produce, and validates that it actually did.
- Eliminates two of the three classes of LLM extraction bugs (hallucinated fields, format drift); doesn't fix content accuracy errors.
- Closely related: page-to-JSON extraction is the API surface that wraps the technique; structured extraction is the broader category.