Skip to main content

What is scraping for RAG?

What is scraping for RAG?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Scraping for RAG (retrieval-augmented generation) is scraping with one specific downstream consumer in mind: a vector store that an LLM will search and ground its answers on. The job isn't 'extract data into a database'; it's 'produce LLM-ready chunks of content with sources attached, in a shape the retrieval-and-grounding loop can use.' Different goal, different output requirements, different pipeline.

What is scraping for RAG?

Scraping for analytics — feeding a database, a dashboard, a price-monitoring system — has straightforward goals: get the right values into the right columns. Scraping for retrieval-augmented generation has a different goal entirely. The output is going into a vector store; an LLM will search that store at query time, retrieve some chunks, and ground its answers on them. What "good output" means is different. The values matter less than the form: clean prose, semantic chunks, preserved structure, citation metadata, stable identifiers. Scraping for RAG is the discipline of producing that shape, on every page, at scale.

What RAG actually consumes

A RAG pipeline at runtime takes a user query, embeds it, searches the vector store for the closest-matching chunks, and feeds those chunks plus the query back to an LLM. The chunks have to be:

  • Self-contained. A chunk should make sense out of context. "It supports up to 50 concurrent users" is useless without "it" being something specific.
  • Semantically coherent. A chunk should be one idea or one section, not a fragment cut mid-sentence by a fixed-character splitter.
  • Citable. Every chunk needs metadata: source URL, page title, fetch date, headings hierarchy. The LLM uses these to attribute facts.
  • Right-sized. Too small, and the chunk lacks context. Too large, and one chunk dilutes the signal of multiple ideas. Typical range: 200–800 tokens depending on the embedding model.

Output that hits these properties is retrieval-ready. Output that doesn't makes the LLM either hallucinate (no good source matched) or confuse itself (the matched chunk was incoherent).

How RAG-targeted scraping differs from regular scraping

Regular scraping (analytics)Scraping for RAG
Output shapeTyped records (rows in a table)Markdown chunks + metadata
Field-level vs. content-levelFieldsContent
SchemaDefined upfrontOutput is mostly prose
Chunking requiredNoYes — semantic boundaries
Citation metadataOptionalMandatory
Used forDatabase, dashboard, monitoringVector store, LLM retrieval
ConsumerOther codeOther LLM calls

Regular scraping wants {"price_usd": 199, "in_stock": true}. RAG scraping wants ## Pricing\n\nThe Pro plan is $199/month and includes ... with {"source_url": "...", "section": "Pricing"} attached. Different problems; same scraping infrastructure underneath.

The Notte SDK shape

The default response of client.scrape(...) is already RAG-friendly: LLM-ready Markdown with metadata. Chunking is the one step that often needs to be added, depending on which embedding/vector-store stack you're using:

main.py
from notte_sdk import NotteClient

client = NotteClient()

# 1. Scrape — returns LLM-ready Markdown + metadata.
result = client.scrape(url="https://docs.example.com/pricing")

# 2. Chunk on heading boundaries (typical for documentation).
def split_by_heading(markdown: str) -> list[str]:
    chunks = []
    current = []
    for line in markdown.splitlines():
        if line.startswith("## ") and current:
            chunks.append("\n".join(current))
            current = []
        current.append(line)
    if current:
        chunks.append("\n".join(current))
    return chunks

chunks = split_by_heading(result.output.markdown)

# 3. Embed + write to your vector store, with metadata attached per chunk.
for chunk in chunks:
    metadata = {
        "source_url": result.output.metadata.url,
        "title": result.output.metadata.title,
        "fetched_at": result.output.metadata.fetched_at,
    }
    # vector_store.add(text=chunk, metadata=metadata)

The scraping side returns LLM-ready content; the RAG side decides how to chunk and what to attach. The boundary is intentional — different RAG stacks chunk differently, and the scraping API stays neutral.

Patterns for production RAG ingestion

Three patterns most teams converge on:

  • Documentation-style content → chunk by heading. Each ## Section becomes one chunk. Works for docs sites, how-to guides, knowledge bases.
  • Article-style content → chunk by paragraph or fixed-token window with overlap. Long articles where the structure is less hierarchical.
  • Reference content (definitions, glossaries, FAQs) → chunk per item. One Q-and-A is one chunk, including the question.

For all three, the consistent pattern is: scrape returns Markdown + metadata; chunking happens in the ingestion pipeline; chunks are tagged with the source metadata at the chunk level (not just the document level).

Common pitfalls

  • Chunking before cleaning. Splitting raw HTML produces chunks full of nav menu and footer text. Strip boilerplate first; chunk second.
  • Fixed-character chunking that cuts mid-sentence. Always break at semantic boundaries (heading, paragraph) when possible.
  • No citation metadata at the chunk level. Document-level metadata isn't enough — RAG needs to attribute which chunk, not just which document, especially for grounding.
  • Re-embedding everything on every update. Stable chunk IDs (hash of normalized chunk text) let you only re-embed what changed.
  • Treating analytics scraping output as RAG-ready. It almost never is. Different output shape; different downstream.

Key takeaways

  • Scraping for RAG produces LLM-ready chunks plus citation metadata for a vector store, not typed records for a database.
  • The output requirements are content-level (Markdown, semantic chunks, metadata), not field-level.
  • Notte's client.scrape(...) returns LLM-ready Markdown by default; chunking happens in the RAG ingestion pipeline.
  • Three chunking patterns by content shape: docs (heading), articles (paragraph + overlap), reference content (per item).

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.