What is scraping for RAG?

Scraping for analytics — feeding a database, a dashboard, a price-monitoring system — has straightforward goals: get the right values into the right columns. Scraping for retrieval-augmented generation has a different goal entirely. The output is going into a vector store; an LLM will search that store at query time, retrieve some chunks, and ground its answers on them. What "good output" means is different. The values matter less than the form: clean prose, semantic chunks, preserved structure, citation metadata, stable identifiers. Scraping for RAG is the discipline of producing that shape, on every page, at scale.

What RAG actually consumes

A RAG pipeline at runtime takes a user query, embeds it, searches the vector store for the closest-matching chunks, and feeds those chunks plus the query back to an LLM. The chunks have to be:

Self-contained. A chunk should make sense out of context. "It supports up to 50 concurrent users" is useless without "it" being something specific.
Semantically coherent. A chunk should be one idea or one section, not a fragment cut mid-sentence by a fixed-character splitter.
Citable. Every chunk needs metadata: source URL, page title, fetch date, headings hierarchy. The LLM uses these to attribute facts.
Right-sized. Too small, and the chunk lacks context. Too large, and one chunk dilutes the signal of multiple ideas. Typical range: 200–800 tokens depending on the embedding model.

Output that hits these properties is retrieval-ready. Output that doesn't makes the LLM either hallucinate (no good source matched) or confuse itself (the matched chunk was incoherent).

How RAG-targeted scraping differs from regular scraping

	Regular scraping (analytics)	Scraping for RAG
Output shape	Typed records (rows in a table)	Markdown chunks + metadata
Field-level vs. content-level	Fields	Content
Schema	Defined upfront	Output is mostly prose
Chunking required	No	Yes — semantic boundaries
Citation metadata	Optional	Mandatory
Used for	Database, dashboard, monitoring	Vector store, LLM retrieval
Consumer	Other code	Other LLM calls

Regular scraping wants {"price_usd": 199, "in_stock": true}. RAG scraping wants ## Pricing\n\nThe Pro plan is $199/month and includes ... with {"source_url": "...", "section": "Pricing"} attached. Different problems; same scraping infrastructure underneath.

The Notte SDK shape

The default response of client.scrape(...) is already RAG-friendly: LLM-ready Markdown with metadata. Chunking is the one step that often needs to be added, depending on which embedding/vector-store stack you're using:

main.py

from notte_sdk import NotteClient

client = NotteClient()

# 1. Scrape — returns LLM-ready Markdown + metadata.
result = client.scrape(url="https://docs.example.com/pricing")

# 2. Chunk on heading boundaries (typical for documentation).
def split_by_heading(markdown: str) -> list[str]:
    chunks = []
    current = []
    for line in markdown.splitlines():
        if line.startswith("## ") and current:
            chunks.append("\n".join(current))
            current = []
        current.append(line)
    if current:
        chunks.append("\n".join(current))
    return chunks

chunks = split_by_heading(result.output.markdown)

# 3. Embed + write to your vector store, with metadata attached per chunk.
for chunk in chunks:
    metadata = {
        "source_url": result.output.metadata.url,
        "title": result.output.metadata.title,
        "fetched_at": result.output.metadata.fetched_at,
    }
    # vector_store.add(text=chunk, metadata=metadata)

The scraping side returns LLM-ready content; the RAG side decides how to chunk and what to attach. The boundary is intentional — different RAG stacks chunk differently, and the scraping API stays neutral.

Patterns for production RAG ingestion

Three patterns most teams converge on:

Documentation-style content → chunk by heading. Each ## Section becomes one chunk. Works for docs sites, how-to guides, knowledge bases.
Article-style content → chunk by paragraph or fixed-token window with overlap. Long articles where the structure is less hierarchical.
Reference content (definitions, glossaries, FAQs) → chunk per item. One Q-and-A is one chunk, including the question.

For all three, the consistent pattern is: scrape returns Markdown + metadata; chunking happens in the ingestion pipeline; chunks are tagged with the source metadata at the chunk level (not just the document level).

Common pitfalls

Chunking before cleaning. Splitting raw HTML produces chunks full of nav menu and footer text. Strip boilerplate first; chunk second.
Fixed-character chunking that cuts mid-sentence. Always break at semantic boundaries (heading, paragraph) when possible.
No citation metadata at the chunk level. Document-level metadata isn't enough — RAG needs to attribute which chunk, not just which document, especially for grounding.
Re-embedding everything on every update. Stable chunk IDs (hash of normalized chunk text) let you only re-embed what changed.
Treating analytics scraping output as RAG-ready. It almost never is. Different output shape; different downstream.

Key takeaways

Scraping for RAG produces LLM-ready chunks plus citation metadata for a vector store, not typed records for a database.
The output requirements are content-level (Markdown, semantic chunks, metadata), not field-level.
Notte's client.scrape(...) returns LLM-ready Markdown by default; chunking happens in the RAG ingestion pipeline.
Three chunking patterns by content shape: docs (heading), articles (paragraph + overlap), reference content (per item).

What is scraping for RAG?