Skip to main content

What is dynamic content scraping?

What is dynamic content scraping?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Dynamic content scraping is the discipline of extracting data from pages where the meaningful content shows up *after* the initial HTML response — single-page apps, infinite-scroll feeds, lazy-loaded sections, content fetched via XHR after first paint. The fix isn't a smarter parser; it's executing the JavaScript in a real browser and waiting until the content actually appears before extracting.

What is dynamic content scraping?

A requests.get(url) returns whatever the server sent in the initial HTML response. On a 2010-era server-rendered site, that's the full content. On a 2026 SaaS site, that's an empty React shell with <div id="root"></div> and a few hundred KB of JavaScript that builds the actual page after page load. Naive scraping returns the empty shell. Dynamic content scraping is the bundle of techniques for getting the post-render state — execute the JavaScript, wait for the content to appear, then extract.

Why "naive scraping" fails on modern sites

Three architectural patterns of modern web apps that break HTML-only scrapers:

  • Single-page apps (SPAs). React, Vue, Svelte, Solid. The server returns a near-empty HTML shell; the client builds the DOM from JavaScript. The HTML the scraper sees is approximately empty.
  • Lazy-loaded sections. Content below the fold doesn't load until the page is scrolled. Scrapers that grab the initial DOM see only the visible-on-load section.
  • XHR-fetched content. The page makes background API calls after first paint and inserts the results. The data you want lives in those XHR responses, not the original HTML.

In all three, the scraper needs to be the browser — execute the JavaScript, wait for the right state, then read.

What "dynamic content scraping" actually requires

Four capabilities that separate static-page scraping from dynamic:

  1. JavaScript execution. A real browser (or a JS-runtime equivalent — see JavaScript rendering for web scraping) running the page's JavaScript before extraction.
  2. Wait strategies. A way to know when the content is ready. Three common signals: a specific element appearing, a network-idle state, an explicit time delay (last resort — fragile).
  3. Scroll-and-wait for lazy-loaded sections. Programmatically scrolling to trigger lazy-load handlers, waiting for new content to render.
  4. XHR interception (sometimes). When the data lives in API responses rather than rendered DOM, capturing the network calls is more reliable than re-reading the rendered output.

Wait strategies, ranked

The single biggest source of flaky dynamic scraping is bad waits. Three patterns, weakest to strongest:

  • Fixed time delay (time.sleep(5)). Almost always wrong — too short and the content isn't there; too long and you waste seconds per page. Use only as a last resort.
  • Network-idle wait. Wait until the page has gone quiet on the network for some period (e.g., 500 ms with no in-flight requests). Catches most XHR-rendered content; misses content that re-fetches periodically.
  • Element-visible wait. Wait for a specific selector to be present, visible, or to have content. The most reliable when you know what you're waiting for; requires knowing the target's DOM structure.

Real production pipelines often combine: network-idle as the default, element-visible for the specific cases where it matters.

The Notte SDK shape

The agent layer handles wait strategy implicitly — it observes the page, decides whether the content is present, and waits if not before proceeding. For ad-hoc dynamic scraping, the session layer exposes the primitives directly:

main.py
from notte_sdk import NotteClient
from pydantic import BaseModel

client = NotteClient()

class Product(BaseModel):
    name: str
    price_eur: float
    review_count: int

# The default scrape handles the dynamic-content case — JS executes, content
# renders, then extraction runs. No explicit wait config needed for most sites.
result = client.scrape(
    url="https://spa-shop.example.com/product/widget",
    response_format=Product,
)

For pages that need explicit interaction — scroll, click "Load more," wait for a specific element — drop down to the session API:

main.py
from notte_sdk import NotteClient

client = NotteClient()

with client.Session() as session:
    session.execute(type="goto", url="https://feed.example.com")
    # Scroll a few times to trigger lazy load, then observe.
    for _ in range(5):
        session.execute(type="scroll", direction="down", amount=1000)
    snapshot = session.observe()
    # snapshot.markdown now includes the lazy-loaded content

What stays static (and what you can scrape without all this)

Three categories of page where dynamic scraping is overkill:

  • Government and academic open-data portals. Mostly server-rendered HTML; requests + BeautifulSoup is fine.
  • News sites and most blogs. Server-rendered for SEO reasons. Dynamic content is in the comments and recommendations, not the article.
  • Documentation sites built with static-site generators. Hugo, Jekyll, Docusaurus output. Static at request time.

Everywhere else — modern SaaS dashboards, e-commerce, social platforms, anything that calls itself an "app" — assume dynamic. You'll be right 80% of the time and wasting browser inference 20% of the time, which is the right error direction.

Common pitfalls

  • time.sleep(N) everywhere. Either too short (flaky scrapes) or too long (wasted seconds per page). Use element-visible or network-idle waits.
  • Reading the DOM before JS finishes. Easy to forget; produces empty results that look like the page was empty rather than not-yet-rendered.
  • Treating one wait strategy as universal. Network-idle works for XHR pages; element-visible works for known-DOM pages; time delay is a last resort. Match the strategy to the site.
  • Not handling lazy-load. Infinite scroll feeds need scroll triggers, not just initial-load waits.

Key takeaways

  • Dynamic content scraping handles pages where content is rendered after the initial HTML response — SPAs, lazy-loaded sections, XHR-fetched content.
  • Requires a real browser executing the JavaScript, plus the right wait strategy for "the content is ready."
  • Three wait strategies, weakest to strongest: fixed delay, network-idle, element-visible. Most production pipelines combine.
  • Notte's client.scrape(...) handles the common case; drop to client.Session(...) for pages that need explicit scroll, click, or interaction before extraction.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.