What is page-to-JSON extraction?

Page-to-JSON extraction is the API shape: pass a URL and a schema (Pydantic, Zod, JSON Schema, or a plain dict), receive a validated structured record. It's the product surface that wraps the underlying schema-based extraction technique behind a single endpoint — no scraping pipeline to build, no parsing to maintain.
What is page-to-JSON extraction?
Most teams that need structured data from a webpage end up assembling the same stack: a scraper, a parser, a schema validator, retry logic, the glue between them. The work is undifferentiated; the bug surface is large; the maintenance never ends. Page-to-JSON extraction collapses that into one HTTP call. The provider does the scraping, runs the LLM, validates against your schema, and returns a typed record — or a clean error explaining why it couldn't.
How it differs from schema-based extraction
Easy to conflate; they're at different levels:
- Schema-based extraction is the technique — using a schema (Pydantic, Zod, JSON Schema) to constrain an LLM's output so it conforms to your types. You can apply it to any text input.
- Page-to-JSON extraction is the API surface — a single endpoint that takes a URL and a schema, fetches the page, applies schema-based extraction internally, and returns the result. It's "schema-based extraction, packaged."
Schema-based is what's happening under the hood. Page-to-JSON is the call you make. Same way "REST" is a technique and "the Stripe API" is a product.
The Notte SDK shape
from notte_sdk import NotteClient
from pydantic import BaseModel
client = NotteClient()
class Product(BaseModel):
name: str
price_eur: float
in_stock: bool
review_count: int | None = None
# URL in, validated Product out — no scraping pipeline to build.
result = client.scrape(
url="https://example-shop.com/product/widget",
response_format=Product,
)
product: Product = result.output
print(product.price_eur, product.in_stock)The pattern accepts more than just Pydantic. The same call works with a JSON Schema dict, a TypedDict, or a list[Model] for collections. From the caller's point of view it's always URL in, validated record out.
What the API does for you
Behind the single call, several layers run:
- Page rendering. A real browser executes the page so JavaScript-rendered content is captured (see JavaScript rendering for web scraping).
- Cleaning. Boilerplate is stripped; the meaningful page content is converted into a form the LLM consumes efficiently.
- Schema-conformant extraction. An LLM call constrained by your schema fills the fields. Schema validation runs at the model layer, not after.
- Retry on failure. If the output doesn't validate, the call retries with the validation error included in the next prompt — the model self-corrects.
- Single response. You get back either a valid record or a clean failure with a reason; nothing in between to parse.
When to use page-to-JSON vs. building it yourself
Reach for page-to-JSON when:
- The data shape fits a schema you can write down.
- You don't want to maintain scraping infrastructure.
- The latency budget allows for an LLM call (seconds, not milliseconds).
Build it yourself when:
- You're scraping at extreme volume and per-call cost matters more than per-call simplicity.
- The data shape changes too often to maintain a schema (rare in practice).
- You need sub-second response times.
For most production use cases, the API call is the right shape. Maintaining a custom Playwright + parser stack is rarely the work that differentiates a product.
Common pitfalls
- Over-specifying the schema. A 40-field Pydantic model from a page that only has 5 visible fields wastes tokens and lowers accuracy. Match schema to what's actually on the page.
- Forgetting
Optionalfields. Real data is missing fields. If your schema says everything is required, validation will fail on the perfectly good rows. - Treating the response as raw text. The whole point is type-safety; pass
response_formatand consumeresult.outputas a real model, not a JSON string.
Key takeaways
- Page-to-JSON extraction is the API shape that wraps schema-based extraction into a single call: URL in, validated record out.
- Notte's
client.scrape(url=..., response_format=Schema)accepts Pydantic, JSON Schema, TypedDict, orlist[Schema]— language-agnostic in shape, ergonomic in Python. - Use it whenever the data fits a writable schema and the latency budget allows an LLM call. Build it yourself only at extreme volume or sub-second SLAs.
- The schema is doing two jobs at once: type-safety for your code and a guard against model hallucinations.