What is page-to-JSON extraction?

Most teams that need structured data from a webpage end up assembling the same stack: a scraper, a parser, a schema validator, retry logic, the glue between them. The work is undifferentiated; the bug surface is large; the maintenance never ends. Page-to-JSON extraction collapses that into one HTTP call. The provider does the scraping, runs the LLM, validates against your schema, and returns a typed record — or a clean error explaining why it couldn't.

How it differs from schema-based extraction

Easy to conflate; they're at different levels:

Schema-based extraction is the technique — using a schema (Pydantic, Zod, JSON Schema) to constrain an LLM's output so it conforms to your types. You can apply it to any text input.
Page-to-JSON extraction is the API surface — a single endpoint that takes a URL and a schema, fetches the page, applies schema-based extraction internally, and returns the result. It's "schema-based extraction, packaged."

Schema-based is what's happening under the hood. Page-to-JSON is the call you make. Same way "REST" is a technique and "the Stripe API" is a product.

The Notte SDK shape

main.py

from notte_sdk import NotteClient
from pydantic import BaseModel

client = NotteClient()

class Product(BaseModel):
    name: str
    price_eur: float
    in_stock: bool
    review_count: int | None = None

# URL in, validated Product out — no scraping pipeline to build.
result = client.scrape(
    url="https://example-shop.com/product/widget",
    response_format=Product,
)

product: Product = result.output
print(product.price_eur, product.in_stock)

The pattern accepts more than just Pydantic. The same call works with a JSON Schema dict, a TypedDict, or a list[Model] for collections. From the caller's point of view it's always URL in, validated record out.

What the API does for you

Behind the single call, several layers run:

Page rendering. A real browser executes the page so JavaScript-rendered content is captured (see JavaScript rendering for web scraping).
Cleaning. Boilerplate is stripped; the meaningful page content is converted into a form the LLM consumes efficiently.
Schema-conformant extraction. An LLM call constrained by your schema fills the fields. Schema validation runs at the model layer, not after.
Retry on failure. If the output doesn't validate, the call retries with the validation error included in the next prompt — the model self-corrects.
Single response. You get back either a valid record or a clean failure with a reason; nothing in between to parse.

When to use page-to-JSON vs. building it yourself

Reach for page-to-JSON when:

The data shape fits a schema you can write down.
You don't want to maintain scraping infrastructure.
The latency budget allows for an LLM call (seconds, not milliseconds).

Build it yourself when:

You're scraping at extreme volume and per-call cost matters more than per-call simplicity.
The data shape changes too often to maintain a schema (rare in practice).
You need sub-second response times.

For most production use cases, the API call is the right shape. Maintaining a custom Playwright + parser stack is rarely the work that differentiates a product.

Common pitfalls

Over-specifying the schema. A 40-field Pydantic model from a page that only has 5 visible fields wastes tokens and lowers accuracy. Match schema to what's actually on the page.
Forgetting Optional fields. Real data is missing fields. If your schema says everything is required, validation will fail on the perfectly good rows.
Treating the response as raw text. The whole point is type-safety; pass response_format and consume result.output as a real model, not a JSON string.

Key takeaways

Page-to-JSON extraction is the API shape that wraps schema-based extraction into a single call: URL in, validated record out.
Notte's client.scrape(url=..., response_format=Schema) accepts Pydantic, JSON Schema, TypedDict, or list[Schema] — language-agnostic in shape, ergonomic in Python.
Use it whenever the data fits a writable schema and the latency budget allows an LLM call. Build it yourself only at extreme volume or sub-second SLAs.
The schema is doing two jobs at once: type-safety for your code and a guard against model hallucinations.

What is page-to-JSON extraction?