What is HTML to Markdown conversion?

The browser renders HTML for humans; LLMs aren't humans, and HTML isn't a particularly useful format for them. Tags carry presentation information that's invisible to a text model; class names and inline styles are noise; the same content rendered as Markdown is five-to-ten times shorter and noticeably easier for the model to parse. HTML-to-Markdown conversion is the standard preprocessing step that bridges the two formats — and despite sounding like a small thing, it's where most of the practical RAG quality wins live.

What good conversion preserves

A useful conversion drops presentation, keeps meaning. The structures that matter:

Headings (<h1> … <h6> → # … ######). The LLM uses these as document outline; chunking pipelines split on them.
Lists (<ul>, <ol> → - and 1.). Item boundaries matter for retrieval and reasoning.
Links (<a href="..."> → [label](url)). Often part of the meaning — references, citations, related concepts.
Code blocks (<pre><code> → triple-backtick fences). Critical for technical content; the language hint helps too.
Tables (<table> → pipe tables). Structured data the model can scan row by row.
Emphasis and strong (<em>, <strong> → * and **). Subtle but it's what authors used to mark important phrases.
Quotes (<blockquote> → >). Signals "this is someone else's words."

What gets dropped: classes, IDs, inline styles, divs that exist only for layout, scripts, ads, navigation, footers, cookie banners, anything that wasn't actually content.

Why the token-count math matters

A typical SaaS page weighs roughly:

	Token count
Raw HTML	8,000–25,000
`BeautifulSoup.get_text()`	1,500–5,000
Boilerplate-stripped Markdown	500–2,500

Different starting points; the ratio holds across most modern pages. The Markdown form costs 5–10× fewer tokens than the raw HTML, but the bigger win is comprehension — the cleaned Markdown lets the model focus on actual content rather than parsing through navigation menus and footer copy. The two effects compound: cheaper and better answers.

What "good" conversion looks like in practice

Three properties separate decent conversion from production-grade:

Boilerplate removal first. The cleanest conversion in the world is wasted if the input includes the consent banner, the nav menu, and the footer. Boilerplate detection (readability-style algorithms, ML-based article extraction, hand-tuned site-specific rules) is part of the pipeline.
Structure-preserving table conversion. Many converters flatten tables into runs of plain text. Good ones produce pipe tables that survive the round-trip.
Link-text preservation in the right places. Some links are essential (citations, definitions, products); some are noise ("Click here," "Read more"). The cleanest conversion keeps the meaningful ones.

The good news: most teams don't have to build this themselves anymore. Modern web-scraping APIs return Markdown by default — see what is a web scraping API. The question is which API does it well, not whether you should write your own.

Notte's shape

main.py

from notte_sdk import NotteClient

client = NotteClient()

# Default output is LLM-ready Markdown — boilerplate stripped, structure preserved.
result = client.scrape(url="https://example.com/article")

print(result.output.markdown)
# # Article Title
# ## Section
# - point one
# - point two
# Body text with [a link](https://...) and **emphasis**.

The default response of client.scrape() is Markdown, not raw HTML. For typed records instead of Markdown, supply a response_format (Pydantic model or JSON Schema) and the same pipeline routes through schema-based extraction.

Conversion vs. extraction

These overlap and get conflated. The clean split:

HTML to Markdown conversion is format translation: same content, friendlier shape. The output is unstructured Markdown text.
Structured extraction is schema-fitting: same content, typed records. The output is a Pydantic model or a JSON object.

Pipelines often run them in sequence: HTML → Markdown for storage and RAG retrieval; Markdown → structured records when a typed answer is needed. Either step alone is useful; combining them is the standard pattern.

Common pitfalls

BeautifulSoup.get_text() as "Markdown." It strips tags and produces plain text, which loses headings, lists, and links. The model gets a wall of text; comprehension drops.
Conversion before boilerplate removal. Converting a 25,000-token raw page produces 8,000-token raw Markdown. Strip the noise first, convert second.
Dropping all links. Many pages encode meaning in links — references, citations, definitions. Indiscriminate stripping loses information.
Hand-rolling the converter. Hundreds of edge cases (nested tables, code blocks inside lists, inline HTML in Markdown). Use an existing library or a managed API.

Key takeaways

HTML-to-Markdown conversion preserves semantic structure (headings, lists, links, tables, code) while dropping presentation noise; the single most impactful preprocessing step for LLM input.
Cuts token cost 5–10× and improves comprehension simultaneously.
Combine with boilerplate removal before conversion for best results; converting raw pages with the noise still in is wasted work.
Notte's client.scrape(url) returns LLM-ready Markdown by default; pair with structured extraction when the downstream needs typed records.

What is HTML to Markdown conversion?