What is HTML to Markdown conversion?

HTML to Markdown conversion turns raw HTML — tags, classes, inline styles, the noise — into clean Markdown that preserves the semantic structure (headings, lists, links, tables, code blocks) without the presentation overhead. The single most important preprocessing step in any RAG or web-data pipeline: it cuts token cost by 5–10× and improves LLM comprehension at the same time.
What is HTML to Markdown conversion?
The browser renders HTML for humans; LLMs aren't humans, and HTML isn't a particularly useful format for them. Tags carry presentation information that's invisible to a text model; class names and inline styles are noise; the same content rendered as Markdown is five-to-ten times shorter and noticeably easier for the model to parse. HTML-to-Markdown conversion is the standard preprocessing step that bridges the two formats — and despite sounding like a small thing, it's where most of the practical RAG quality wins live.
What good conversion preserves
A useful conversion drops presentation, keeps meaning. The structures that matter:
- Headings (
<h1>…<h6>→#…######). The LLM uses these as document outline; chunking pipelines split on them. - Lists (
<ul>,<ol>→-and1.). Item boundaries matter for retrieval and reasoning. - Links (
<a href="...">→[label](url)). Often part of the meaning — references, citations, related concepts. - Code blocks (
<pre><code>→ triple-backtick fences). Critical for technical content; the language hint helps too. - Tables (
<table>→ pipe tables). Structured data the model can scan row by row. - Emphasis and strong (
<em>,<strong>→*and**). Subtle but it's what authors used to mark important phrases. - Quotes (
<blockquote>→>). Signals "this is someone else's words."
What gets dropped: classes, IDs, inline styles, divs that exist only for layout, scripts, ads, navigation, footers, cookie banners, anything that wasn't actually content.
Why the token-count math matters
A typical SaaS page weighs roughly:
| Token count | |
|---|---|
| Raw HTML | 8,000–25,000 |
BeautifulSoup.get_text() | 1,500–5,000 |
| Boilerplate-stripped Markdown | 500–2,500 |
Different starting points; the ratio holds across most modern pages. The Markdown form costs 5–10× fewer tokens than the raw HTML, but the bigger win is comprehension — the cleaned Markdown lets the model focus on actual content rather than parsing through navigation menus and footer copy. The two effects compound: cheaper and better answers.
What "good" conversion looks like in practice
Three properties separate decent conversion from production-grade:
- Boilerplate removal first. The cleanest conversion in the world is wasted if the input includes the consent banner, the nav menu, and the footer. Boilerplate detection (readability-style algorithms, ML-based article extraction, hand-tuned site-specific rules) is part of the pipeline.
- Structure-preserving table conversion. Many converters flatten tables into runs of plain text. Good ones produce pipe tables that survive the round-trip.
- Link-text preservation in the right places. Some links are essential (citations, definitions, products); some are noise ("Click here," "Read more"). The cleanest conversion keeps the meaningful ones.
The good news: most teams don't have to build this themselves anymore. Modern web-scraping APIs return Markdown by default — see what is a web scraping API. The question is which API does it well, not whether you should write your own.
Notte's shape
from notte_sdk import NotteClient
client = NotteClient()
# Default output is LLM-ready Markdown — boilerplate stripped, structure preserved.
result = client.scrape(url="https://example.com/article")
print(result.output.markdown)
# # Article Title
# ## Section
# - point one
# - point two
# Body text with [a link](https://...) and **emphasis**.The default response of client.scrape() is Markdown, not raw HTML. For typed records instead of Markdown, supply a response_format (Pydantic model or JSON Schema) and the same pipeline routes through schema-based extraction.
Conversion vs. extraction
These overlap and get conflated. The clean split:
- HTML to Markdown conversion is format translation: same content, friendlier shape. The output is unstructured Markdown text.
- Structured extraction is schema-fitting: same content, typed records. The output is a Pydantic model or a JSON object.
Pipelines often run them in sequence: HTML → Markdown for storage and RAG retrieval; Markdown → structured records when a typed answer is needed. Either step alone is useful; combining them is the standard pattern.
Common pitfalls
BeautifulSoup.get_text()as "Markdown." It strips tags and produces plain text, which loses headings, lists, and links. The model gets a wall of text; comprehension drops.- Conversion before boilerplate removal. Converting a 25,000-token raw page produces 8,000-token raw Markdown. Strip the noise first, convert second.
- Dropping all links. Many pages encode meaning in links — references, citations, definitions. Indiscriminate stripping loses information.
- Hand-rolling the converter. Hundreds of edge cases (nested tables, code blocks inside lists, inline HTML in Markdown). Use an existing library or a managed API.
Key takeaways
- HTML-to-Markdown conversion preserves semantic structure (headings, lists, links, tables, code) while dropping presentation noise; the single most impactful preprocessing step for LLM input.
- Cuts token cost 5–10× and improves comprehension simultaneously.
- Combine with boilerplate removal before conversion for best results; converting raw pages with the noise still in is wasted work.
- Notte's
client.scrape(url)returns LLM-ready Markdown by default; pair with structured extraction when the downstream needs typed records.