What is LLM-ready content?

A naive scrape of a webpage hands the model an enormous string full of navigation menus, cookie banners, footer copy, ad slots, and the actual content. The model spends tokens on all of it. It also tends to hallucinate — once a banner phrase like "Subscribe for updates" is in the context, the model can use it as if it were part of the article. LLM-ready content is what you get after fixing this: the same content, with the noise stripped, the structure preserved, the metadata attached, and the chunking already done. It's the difference between dumping requests.get(url).text into a prompt and feeding the model something a human would actually want to read.

What "ready" actually requires

Five properties separate LLM-ready content from raw scraped HTML:

Boilerplate removed. Nav, footer, sidebar, cookie banners, ad slots, related-articles widgets — gone. What's left is the article body, the documentation page, the product description.
Converted to Markdown. HTML tags drop away; the structure they encoded (headings, lists, tables, code blocks, links) survives in a form ten times smaller. See HTML to Markdown conversion.
Semantic chunking. Long pages split at heading boundaries, not at fixed character counts that cut sentences in half. Each chunk is a coherent unit of meaning.
Citation metadata. Source URL, fetch timestamp, page title, author when known, language. RAG pipelines need this for attribution; without it the LLM can't tell the user where a fact came from.
Stable identifiers. A way to refer to a specific chunk later — a hash, a path, an anchor — so updates can be diffed and stale chunks invalidated.

A scraper that returns "the page as text" without any of these is producing input you'll have to clean. A scraper that returns LLM-ready content has done the cleaning for you.

Why it matters more than people think

Three concrete consequences of unclean input:

Token cost. A typical article page is 70–90% boilerplate by character count. Sending the raw page costs 3–10× the tokens of sending the cleaned content. At scale this dominates the LLM bill.
Answer quality. "What's the company's pricing?" against a page full of nav links can return the navigation item "Pricing" instead of the actual price. The boilerplate competes with the content for the model's attention.
Citation accuracy. "According to the article…" is wrong if the model is quoting the footer. RAG pipelines that track sources by URL alone, not by chunk, get caught here.

LLM-ready preprocessing isn't a polish step. It's the difference between a RAG system that occasionally works and one that consistently does.

Notte's posture

Notte's client.scrape(...) returns LLM-ready content by default — Markdown-converted, boilerplate-stripped, with metadata attached. For most use cases there's no extra preprocessing step:

main.py

from notte_sdk import NotteClient

client = NotteClient()

# Returns LLM-ready Markdown plus structured metadata.
result = client.scrape(url="https://example.com/article")

print(result.output.markdown)         # cleaned, structured Markdown
print(result.output.metadata.title)   # page title
print(result.output.metadata.url)     # canonical URL

For structured extraction (typed records, not Markdown), pass a response_format and the same pipeline routes through schema-based extraction instead.

Markdown vs. plain text vs. structured records

Three output shapes all called "LLM-ready" in different contexts:

	Markdown	Plain text	Structured records
Preserves structure	Yes	No	Schema-defined
Token cost	Low (small markup overhead)	Lowest	Lowest (only the fields)
LLM comprehension	Highest (models trained on Markdown)	Lower (no headings/lists)	Already typed; no extraction needed
Best for	RAG, summarization, agent context	Search snippets, brief excerpts	Database / API rows
Notte default	`client.scrape(url)`	n/a	`client.scrape(url, response_format=Schema)`

Markdown is the right default for most LLM-ready use cases. Structured records win when the downstream system wants typed data; plain text is rarely the right choice unless you're optimizing extreme token costs at low quality.

Common pitfalls

Treating BeautifulSoup.get_text() as "LLM-ready." It strips tags but keeps the boilerplate. Token cost stays high; answer quality stays low.
Fixed-character chunking that cuts mid-sentence. Always break at headings, paragraphs, or other semantic boundaries.
Stripping links along with HTML. The links are part of the content's meaning. Keep them as Markdown link syntax.
No metadata. RAG needs source attribution. Strip the boilerplate, keep the citation.

Key takeaways

LLM-ready content is web content cleaned of boilerplate, converted to Markdown, semantically chunked, and tagged with citation metadata — drop-in input for prompts and vector stores.
The five properties: boilerplate removed, Markdown-formatted, semantically chunked, metadata-tagged, identifier-stable.
Token cost, answer quality, and citation accuracy all degrade noticeably on raw scraped HTML; LLM-ready preprocessing is essential, not optional.
Notte's client.scrape(url) returns LLM-ready Markdown by default; pair with structured extraction when the output should be typed instead of Markdown.

What is LLM-ready content?