Skip to main content

What is schema-based extraction?

What is schema-based extraction?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Schema-based extraction is the technique of supplying an explicit schema (Pydantic, JSON Schema, Zod) to an LLM-driven extractor so the output is type-checked and predictable. The schema does two jobs at once: it tells the model what shape to produce, and it catches the cases where the model produced something that doesn't fit. Reduces hallucinations, eliminates parsing code, and turns 'pull data from this text' from a fragile pipeline into a single typed call.

What is schema-based extraction?

Before LLM-driven extraction, pulling structured data out of unstructured text was a multi-stage pipeline: a regex or selector to find the candidate, parsing logic to clean it, validation to catch mistakes. Each stage was a place to break, and "the model returned weird stuff this time" was a category of bug everyone shipped. Schema-based extraction is the realization that you can collapse the pipeline into one stage by handing the model a schema and asking it to fill the schema directly. The schema does two jobs at once — it tells the model what shape to produce, and it acts as the validator that catches the cases where it didn't.

What "schema" means in this context

The schema is whatever describes the desired output shape in a way the LLM provider can constrain on. The most common forms:

  • Pydantic models (Python) — the most common in agent stacks, since Pydantic doubles as the runtime validation layer.
  • JSON Schema (language-agnostic) — the underlying standard most LLM providers' "structured output" modes constrain on.
  • Zod schemas (TypeScript / JavaScript) — the equivalent in JS/TS stacks.
  • TypedDict, dataclass, attrs in Python; equivalents elsewhere — used as schema sources by libraries that translate them to JSON Schema.

The model providers (OpenAI's response_format, Anthropic's tool-use schemas, Google's controlled generation) all expose a way to constrain output to a schema. Schema-based extraction is using that constraint on the way out, not just generating free text and parsing it after.

Why this changes the failure mode

Without a schema, an LLM extracting data fails in three creative ways:

  • Hallucinated fields. Returns a key the schema didn't ask for. Code that expects a flat shape blows up.
  • Hallucinated values. Invents a price, a date, a name that wasn't in the source. Plausible-looking, completely wrong.
  • Format drift. Returns a date as "March 5, 2024" one run and "2024-03-05" the next. Downstream parsing breaks intermittently.

A schema closes all three. Constrained generation makes hallucinated fields impossible at the protocol layer; required-field validation catches missing values; type validation catches format drift. What's left — actual content errors, where the model picked the wrong number from the page — is a genuine accuracy problem schemas don't solve. That's a smaller surface than the unconstrained version.

Schema-based extraction vs. unconstrained extraction

UnconstrainedSchema-based
Output shapeFree text or unconstrained JSONSchema-conformant
Hallucinated fieldsPossibleImpossible (constraint layer)
Type safety in callerManual coercionAutomatic
Parse errors downstreamCommonRare
Handles missing valuesSilently or with a guessForces explicit Optional / null
Best forQuick exploration, free-form summariesProduction extraction

The empirical pattern: production extraction has converged on schema-based. Unconstrained extraction is the right answer for "summarize this page" or "what's the sentiment of this review" — outputs that aren't naturally typed. Anywhere the output is actually structured, the schema is the lower-effort path.

The Notte SDK shape

main.py
from notte_sdk import NotteClient
from pydantic import BaseModel

client = NotteClient()

class Article(BaseModel):
    title: str
    author: str
    published_at: str  # ISO 8601
    word_count: int | None = None

result = client.scrape(
    url="https://news.example.com/article/123",
    response_format=Article,
)

article: Article = result.output
print(article.title, article.published_at)

The schema is the only extraction logic — there's no separate parsing step, no field-by-field selectors. The Pydantic model tells the LLM what to produce, and Pydantic itself validates the output. Same call works with list[Article] for collections, with JSONSchema dicts directly, and (in TypeScript) with Zod schemas.

The closest cousin is page-to-JSON extraction — the API surface that wraps schema-based extraction into a single URL→record call. Schema-based is the technique; page-to-JSON is the product.

Common pitfalls

  • Over-specifying the schema. Forcing every field as required when the source rarely has all of them. Validation fails on perfectly good rows. Use Optional[...] liberally — the schema should match what's available, not what's ideal.
  • Schemas with vague type signatures. str instead of Literal["pending", "shipped", "delivered"] lets the model invent statuses. Constrain when the values are knowable.
  • Skipping description / docstrings on schema fields. The model uses these as hints. A field called revenue: float extracts worse than revenue_usd: float = Field(description="Annual revenue in USD; null if not disclosed.").
  • Unbounded dict[str, Any]. Defeats the whole purpose. If you need flexibility, model the union explicitly.

Key takeaways

  • Schema-based extraction supplies a typed schema (Pydantic, JSON Schema, Zod) to an LLM-driven extractor so the output is constrained and validated.
  • The schema does two jobs: tells the model what to produce, and validates that it actually did.
  • Eliminates two of the three classes of LLM extraction bugs (hallucinated fields, format drift); doesn't fix content accuracy errors.
  • Closely related: page-to-JSON extraction is the API surface that wraps the technique; structured extraction is the broader category.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.