Web Data for AI
Once you've fetched a webpage, the harder problem begins: turning messy HTML into something an AI system can actually use. This category defines the vocabulary of that transformation — structured extraction with schemas (Pydantic, JSON Schema), HTML-to-Markdown conversion, schema-based and natural-language extraction, table and PDF parsing, and the broader category of 'LLM-ready content.' The terms here matter because the quality of the data you feed an LLM dominates the quality of the output: a perfectly clean, semantically structured page beats a noisy raw HTML dump every time. These definitions cover the patterns and formats that bridge the open web and the systems that learn from it.
Other categories
Definitions and concepts for building, evaluating, and operating AI agents that drive a real browser.
Digital identities, credential vaults, 2FA, CAPTCHAs, and the patterns AI agents need to log in like a real user.
Foundational concepts: headless browsers, cloud browsers, fingerprinting, proxies, sessions, and detection.
Wrap browser-driven work as callable Web APIs — the layer that exposes agent runs as durable, scheduled, schema-typed endpoints.
Scraping APIs, anti-scraping defenses, dynamic content, and the patterns for getting data off the modern web.
Build your AI agent on the open web with Notte
Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.