Skip to main content

What is HTML to Markdown conversion?

What is HTML to Markdown conversion?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

HTML to Markdown conversion turns raw HTML — tags, classes, inline styles, the noise — into clean Markdown that preserves the semantic structure (headings, lists, links, tables, code blocks) without the presentation overhead. The single most important preprocessing step in any RAG or web-data pipeline: it cuts token cost by 5–10× and improves LLM comprehension at the same time.

What is HTML to Markdown conversion?

The browser renders HTML for humans; LLMs aren't humans, and HTML isn't a particularly useful format for them. Tags carry presentation information that's invisible to a text model; class names and inline styles are noise; the same content rendered as Markdown is five-to-ten times shorter and noticeably easier for the model to parse. HTML-to-Markdown conversion is the standard preprocessing step that bridges the two formats — and despite sounding like a small thing, it's where most of the practical RAG quality wins live.

What good conversion preserves

A useful conversion drops presentation, keeps meaning. The structures that matter:

  • Headings (<h1><h6>#######). The LLM uses these as document outline; chunking pipelines split on them.
  • Lists (<ul>, <ol>- and 1.). Item boundaries matter for retrieval and reasoning.
  • Links (<a href="...">[label](url)). Often part of the meaning — references, citations, related concepts.
  • Code blocks (<pre><code> → triple-backtick fences). Critical for technical content; the language hint helps too.
  • Tables (<table> → pipe tables). Structured data the model can scan row by row.
  • Emphasis and strong (<em>, <strong>* and **). Subtle but it's what authors used to mark important phrases.
  • Quotes (<blockquote>>). Signals "this is someone else's words."

What gets dropped: classes, IDs, inline styles, divs that exist only for layout, scripts, ads, navigation, footers, cookie banners, anything that wasn't actually content.

Why the token-count math matters

A typical SaaS page weighs roughly:

Token count
Raw HTML8,000–25,000
BeautifulSoup.get_text()1,500–5,000
Boilerplate-stripped Markdown500–2,500

Different starting points; the ratio holds across most modern pages. The Markdown form costs 5–10× fewer tokens than the raw HTML, but the bigger win is comprehension — the cleaned Markdown lets the model focus on actual content rather than parsing through navigation menus and footer copy. The two effects compound: cheaper and better answers.

What "good" conversion looks like in practice

Three properties separate decent conversion from production-grade:

  • Boilerplate removal first. The cleanest conversion in the world is wasted if the input includes the consent banner, the nav menu, and the footer. Boilerplate detection (readability-style algorithms, ML-based article extraction, hand-tuned site-specific rules) is part of the pipeline.
  • Structure-preserving table conversion. Many converters flatten tables into runs of plain text. Good ones produce pipe tables that survive the round-trip.
  • Link-text preservation in the right places. Some links are essential (citations, definitions, products); some are noise ("Click here," "Read more"). The cleanest conversion keeps the meaningful ones.

The good news: most teams don't have to build this themselves anymore. Modern web-scraping APIs return Markdown by default — see what is a web scraping API. The question is which API does it well, not whether you should write your own.

Notte's shape

main.py
from notte_sdk import NotteClient

client = NotteClient()

# Default output is LLM-ready Markdown — boilerplate stripped, structure preserved.
result = client.scrape(url="https://example.com/article")

print(result.output.markdown)
# # Article Title
# ## Section
# - point one
# - point two
# Body text with [a link](https://...) and **emphasis**.

The default response of client.scrape() is Markdown, not raw HTML. For typed records instead of Markdown, supply a response_format (Pydantic model or JSON Schema) and the same pipeline routes through schema-based extraction.

Conversion vs. extraction

These overlap and get conflated. The clean split:

  • HTML to Markdown conversion is format translation: same content, friendlier shape. The output is unstructured Markdown text.
  • Structured extraction is schema-fitting: same content, typed records. The output is a Pydantic model or a JSON object.

Pipelines often run them in sequence: HTML → Markdown for storage and RAG retrieval; Markdown → structured records when a typed answer is needed. Either step alone is useful; combining them is the standard pattern.

Common pitfalls

  • BeautifulSoup.get_text() as "Markdown." It strips tags and produces plain text, which loses headings, lists, and links. The model gets a wall of text; comprehension drops.
  • Conversion before boilerplate removal. Converting a 25,000-token raw page produces 8,000-token raw Markdown. Strip the noise first, convert second.
  • Dropping all links. Many pages encode meaning in links — references, citations, definitions. Indiscriminate stripping loses information.
  • Hand-rolling the converter. Hundreds of edge cases (nested tables, code blocks inside lists, inline HTML in Markdown). Use an existing library or a managed API.

Key takeaways

  • HTML-to-Markdown conversion preserves semantic structure (headings, lists, links, tables, code) while dropping presentation noise; the single most impactful preprocessing step for LLM input.
  • Cuts token cost 5–10× and improves comprehension simultaneously.
  • Combine with boilerplate removal before conversion for best results; converting raw pages with the noise still in is wasted work.
  • Notte's client.scrape(url) returns LLM-ready Markdown by default; pair with structured extraction when the downstream needs typed records.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.