Skip to main content

What is page-to-JSON extraction?

What is page-to-JSON extraction?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Page-to-JSON extraction is the API shape: pass a URL and a schema (Pydantic, Zod, JSON Schema, or a plain dict), receive a validated structured record. It's the product surface that wraps the underlying schema-based extraction technique behind a single endpoint — no scraping pipeline to build, no parsing to maintain.

What is page-to-JSON extraction?

Most teams that need structured data from a webpage end up assembling the same stack: a scraper, a parser, a schema validator, retry logic, the glue between them. The work is undifferentiated; the bug surface is large; the maintenance never ends. Page-to-JSON extraction collapses that into one HTTP call. The provider does the scraping, runs the LLM, validates against your schema, and returns a typed record — or a clean error explaining why it couldn't.

How it differs from schema-based extraction

Easy to conflate; they're at different levels:

  • Schema-based extraction is the technique — using a schema (Pydantic, Zod, JSON Schema) to constrain an LLM's output so it conforms to your types. You can apply it to any text input.
  • Page-to-JSON extraction is the API surface — a single endpoint that takes a URL and a schema, fetches the page, applies schema-based extraction internally, and returns the result. It's "schema-based extraction, packaged."

Schema-based is what's happening under the hood. Page-to-JSON is the call you make. Same way "REST" is a technique and "the Stripe API" is a product.

The Notte SDK shape

main.py
from notte_sdk import NotteClient
from pydantic import BaseModel

client = NotteClient()

class Product(BaseModel):
    name: str
    price_eur: float
    in_stock: bool
    review_count: int | None = None

# URL in, validated Product out — no scraping pipeline to build.
result = client.scrape(
    url="https://example-shop.com/product/widget",
    response_format=Product,
)

product: Product = result.output
print(product.price_eur, product.in_stock)

The pattern accepts more than just Pydantic. The same call works with a JSON Schema dict, a TypedDict, or a list[Model] for collections. From the caller's point of view it's always URL in, validated record out.

What the API does for you

Behind the single call, several layers run:

  • Page rendering. A real browser executes the page so JavaScript-rendered content is captured (see JavaScript rendering for web scraping).
  • Cleaning. Boilerplate is stripped; the meaningful page content is converted into a form the LLM consumes efficiently.
  • Schema-conformant extraction. An LLM call constrained by your schema fills the fields. Schema validation runs at the model layer, not after.
  • Retry on failure. If the output doesn't validate, the call retries with the validation error included in the next prompt — the model self-corrects.
  • Single response. You get back either a valid record or a clean failure with a reason; nothing in between to parse.

When to use page-to-JSON vs. building it yourself

Reach for page-to-JSON when:

  • The data shape fits a schema you can write down.
  • You don't want to maintain scraping infrastructure.
  • The latency budget allows for an LLM call (seconds, not milliseconds).

Build it yourself when:

  • You're scraping at extreme volume and per-call cost matters more than per-call simplicity.
  • The data shape changes too often to maintain a schema (rare in practice).
  • You need sub-second response times.

For most production use cases, the API call is the right shape. Maintaining a custom Playwright + parser stack is rarely the work that differentiates a product.

Common pitfalls

  • Over-specifying the schema. A 40-field Pydantic model from a page that only has 5 visible fields wastes tokens and lowers accuracy. Match schema to what's actually on the page.
  • Forgetting Optional fields. Real data is missing fields. If your schema says everything is required, validation will fail on the perfectly good rows.
  • Treating the response as raw text. The whole point is type-safety; pass response_format and consume result.output as a real model, not a JSON string.

Key takeaways

  • Page-to-JSON extraction is the API shape that wraps schema-based extraction into a single call: URL in, validated record out.
  • Notte's client.scrape(url=..., response_format=Schema) accepts Pydantic, JSON Schema, TypedDict, or list[Schema] — language-agnostic in shape, ergonomic in Python.
  • Use it whenever the data fits a writable schema and the latency budget allows an LLM call. Build it yourself only at extreme volume or sub-second SLAs.
  • The schema is doing two jobs at once: type-safety for your code and a guard against model hallucinations.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.