What is structured extraction?

Structured extraction is the process of pulling information out of unstructured text (HTML, PDF, screenshot) into a typed, schema-validated record. Modern extraction combines LLMs (which are good at understanding context) with schema validation (which catches hallucinations).
What is structured extraction?
Structured extraction is the step that turns unstructured input — raw HTML, a PDF, a screenshot, a Markdown blob — into typed, schema-validated data: a row in a database, a Pydantic model, a JSON object that conforms to a JSON Schema. It's the bridge between the messy open web and the clean type-safe data your code wants to consume.
Two ideas have collided to make modern structured extraction work: large language models, which are exceptionally good at reading context-dependent natural language, and structured-output systems (Pydantic, Zod, JSON Schema, OpenAI's response format), which let you constrain the model's output to a specific shape. Together they let you ask "give me this URL as a list of Job records" and get back something a downstream system can validate and use.
The problem structured extraction solves
The classical extraction stack was a pipeline of brittle stages:
- Scrape the page.
- Find the right element with a CSS selector.
- Pull text out of it.
- Clean it with regexes.
- Hope the page never changes.
Every stage is a maintenance liability. Selectors break when the site redesigns. Regexes break on edge cases. The output isn't typed — you get back strings and have to coerce them. And if the data is laid out in a way that doesn't map cleanly to a selector (a free-form description that contains the price you want), you're stuck.
Structured extraction collapses that pipeline. You hand the model the page and a schema; the model figures out which parts of the input correspond to which fields, returns the result in the right shape, and the runtime validates it before you see it. When the site changes, the model adapts — the schema doesn't.
How structured extraction works
The recipe in modern extraction systems looks like this:
- Get clean input. Convert the source (HTML, PDF, screenshot) to a form an LLM reads efficiently. For HTML, that usually means Markdown conversion with boilerplate stripped.
- Define the schema. Specify the shape of the output as a Pydantic model, JSON Schema, or Zod schema. This is the contract.
- Constrain the model's output. Use the LLM provider's structured-output mode (or a tool-calling pattern) so the model is forced to emit valid JSON that conforms to the schema.
- Validate. Parse the output through the schema. Schemas catch hallucinated fields, wrong types, and missing required values before they reach your code.
- Retry on failure. If validation fails, retry with the error message included in the next prompt — the model uses it to correct itself.
A minimal Notte example:
from notte_sdk import NotteClient
from pydantic import BaseModel
client = NotteClient()
class Product(BaseModel):
name: str
price_eur: float
in_stock: bool
review_count: int | None = None
# Notte runs the whole stack: render, clean, extract, validate.
result = client.scrape(
url="https://example-shop.com/product/widget",
response_format=Product,
)
product: Product = result.output
print(product.price_eur, product.in_stock)Notte handles the rendering, the cleanup, the LLM call with the schema, and the validation in a single API call. From your side, it's URL + schema → Product.
Structured extraction vs. selector-based vs. LLM-only
| Selector-based | Structured extraction (LLM + schema) | LLM-only (no schema) | |
|---|---|---|---|
| Survives layout changes | No | Yes | Yes |
| Output shape guaranteed | If selectors work | Yes (schema-validated) | No |
| Hallucination risk | None (it just breaks) | Low (schema catches it) | High |
| Cost per call | Lowest | Mid | Mid |
| Best for | Stable, high-volume sites | Most production cases | Quick exploration |
The middle column is where almost all production extraction has converged. Pure selector-based scraping survives only on sites that never change. Pure LLM output without a schema is unsafe to feed into typed systems.
Common use cases
Structured extraction shows up wherever the open web meets typed code:
- Building agent tool inputs. A browser agent or an Anything API workflow needs to return typed data so callers can compose it.
- Lead enrichment. Turn LinkedIn / Crunchbase / company-website pages into a
Companyrecord with industry, size, tech stack. - Product monitoring. Track e-commerce listings as
Productrecords — price, availability, reviews — over time. - Job and listing aggregation. Crawl job boards or real-estate sites into uniform records that can be filtered and de-duped.
- Document understanding. Extract structured data from PDFs, invoices, contracts.
When to use structured extraction
Use it when:
- The shape of the data you want is stable, even if the source layout isn't.
- The output flows into a typed system (database, Pydantic-using backend, API response).
- You can specify the schema up front (you usually can).
- The source has enough natural-language context for an LLM to disambiguate fields.
Don't reach for it when:
- The data is already structured and accessible (a public API, a CSV download).
- You need millisecond response times — LLM calls add hundreds of ms.
- The schema is too vague to constrain the output meaningfully.
Key takeaways
- Structured extraction is the modern replacement for selectors + regexes: hand the LLM the page and a schema, get back typed validated data.
- Schemas are the safety net that catches hallucinations before they reach your code.
- The LLM provides robustness to layout changes; the schema provides robustness to model errors.
- It's the right default for production extraction; selector-based scraping is reserved for stable, high-volume sites where the cost difference matters.
- Once your output is a typed model, downstream composition with the rest of your codebase is the same as any other API.
If you're working in Python, page-to-JSON extraction is the most direct next read. For broader format coverage, schema-based extraction covers JSON Schema and other type systems.