What is scraping for RAG?

Scraping for RAG (retrieval-augmented generation) is scraping with one specific downstream consumer in mind: a vector store that an LLM will search and ground its answers on. The job isn't 'extract data into a database'; it's 'produce LLM-ready chunks of content with sources attached, in a shape the retrieval-and-grounding loop can use.' Different goal, different output requirements, different pipeline.
What is scraping for RAG?
Scraping for analytics — feeding a database, a dashboard, a price-monitoring system — has straightforward goals: get the right values into the right columns. Scraping for retrieval-augmented generation has a different goal entirely. The output is going into a vector store; an LLM will search that store at query time, retrieve some chunks, and ground its answers on them. What "good output" means is different. The values matter less than the form: clean prose, semantic chunks, preserved structure, citation metadata, stable identifiers. Scraping for RAG is the discipline of producing that shape, on every page, at scale.
What RAG actually consumes
A RAG pipeline at runtime takes a user query, embeds it, searches the vector store for the closest-matching chunks, and feeds those chunks plus the query back to an LLM. The chunks have to be:
- Self-contained. A chunk should make sense out of context. "It supports up to 50 concurrent users" is useless without "it" being something specific.
- Semantically coherent. A chunk should be one idea or one section, not a fragment cut mid-sentence by a fixed-character splitter.
- Citable. Every chunk needs metadata: source URL, page title, fetch date, headings hierarchy. The LLM uses these to attribute facts.
- Right-sized. Too small, and the chunk lacks context. Too large, and one chunk dilutes the signal of multiple ideas. Typical range: 200–800 tokens depending on the embedding model.
Output that hits these properties is retrieval-ready. Output that doesn't makes the LLM either hallucinate (no good source matched) or confuse itself (the matched chunk was incoherent).
How RAG-targeted scraping differs from regular scraping
| Regular scraping (analytics) | Scraping for RAG | |
|---|---|---|
| Output shape | Typed records (rows in a table) | Markdown chunks + metadata |
| Field-level vs. content-level | Fields | Content |
| Schema | Defined upfront | Output is mostly prose |
| Chunking required | No | Yes — semantic boundaries |
| Citation metadata | Optional | Mandatory |
| Used for | Database, dashboard, monitoring | Vector store, LLM retrieval |
| Consumer | Other code | Other LLM calls |
Regular scraping wants {"price_usd": 199, "in_stock": true}. RAG scraping wants ## Pricing\n\nThe Pro plan is $199/month and includes ... with {"source_url": "...", "section": "Pricing"} attached. Different problems; same scraping infrastructure underneath.
The Notte SDK shape
The default response of client.scrape(...) is already RAG-friendly: LLM-ready Markdown with metadata. Chunking is the one step that often needs to be added, depending on which embedding/vector-store stack you're using:
from notte_sdk import NotteClient
client = NotteClient()
# 1. Scrape — returns LLM-ready Markdown + metadata.
result = client.scrape(url="https://docs.example.com/pricing")
# 2. Chunk on heading boundaries (typical for documentation).
def split_by_heading(markdown: str) -> list[str]:
chunks = []
current = []
for line in markdown.splitlines():
if line.startswith("## ") and current:
chunks.append("\n".join(current))
current = []
current.append(line)
if current:
chunks.append("\n".join(current))
return chunks
chunks = split_by_heading(result.output.markdown)
# 3. Embed + write to your vector store, with metadata attached per chunk.
for chunk in chunks:
metadata = {
"source_url": result.output.metadata.url,
"title": result.output.metadata.title,
"fetched_at": result.output.metadata.fetched_at,
}
# vector_store.add(text=chunk, metadata=metadata)The scraping side returns LLM-ready content; the RAG side decides how to chunk and what to attach. The boundary is intentional — different RAG stacks chunk differently, and the scraping API stays neutral.
Patterns for production RAG ingestion
Three patterns most teams converge on:
- Documentation-style content → chunk by heading. Each
## Sectionbecomes one chunk. Works for docs sites, how-to guides, knowledge bases. - Article-style content → chunk by paragraph or fixed-token window with overlap. Long articles where the structure is less hierarchical.
- Reference content (definitions, glossaries, FAQs) → chunk per item. One Q-and-A is one chunk, including the question.
For all three, the consistent pattern is: scrape returns Markdown + metadata; chunking happens in the ingestion pipeline; chunks are tagged with the source metadata at the chunk level (not just the document level).
Common pitfalls
- Chunking before cleaning. Splitting raw HTML produces chunks full of nav menu and footer text. Strip boilerplate first; chunk second.
- Fixed-character chunking that cuts mid-sentence. Always break at semantic boundaries (heading, paragraph) when possible.
- No citation metadata at the chunk level. Document-level metadata isn't enough — RAG needs to attribute which chunk, not just which document, especially for grounding.
- Re-embedding everything on every update. Stable chunk IDs (hash of normalized chunk text) let you only re-embed what changed.
- Treating analytics scraping output as RAG-ready. It almost never is. Different output shape; different downstream.
Key takeaways
- Scraping for RAG produces LLM-ready chunks plus citation metadata for a vector store, not typed records for a database.
- The output requirements are content-level (Markdown, semantic chunks, metadata), not field-level.
- Notte's
client.scrape(...)returns LLM-ready Markdown by default; chunking happens in the RAG ingestion pipeline. - Three chunking patterns by content shape: docs (heading), articles (paragraph + overlap), reference content (per item).