Skip to main content

Web Data for AI

Once you've fetched a webpage, the harder problem begins: turning messy HTML into something an AI system can actually use. This category defines the vocabulary of that transformation — structured extraction with schemas (Pydantic, JSON Schema), HTML-to-Markdown conversion, schema-based and natural-language extraction, table and PDF parsing, and the broader category of 'LLM-ready content.' The terms here matter because the quality of the data you feed an LLM dominates the quality of the output: a perfectly clean, semantically structured page beats a noisy raw HTML dump every time. These definitions cover the patterns and formats that bridge the open web and the systems that learn from it.

5 terms in this category

Other categories

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.