Skip to main content

What is anti-scraping?

What is anti-scraping?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Anti-scraping is the umbrella term for the technical and legal defenses websites use to deter scraping. It includes rate limiting, IP-reputation filtering, browser fingerprinting, CAPTCHAs, signed-URL tokens, JavaScript challenges, ML-based behavioral detection, and terms-of-service prohibitions. Each layer is bypassable in isolation; the combined posture is what shapes how modern scraping has to be done.

What is anti-scraping?

Almost every site of any commercial value runs some form of anti-scraping. The motivations vary — protecting proprietary content, controlling load, blocking competitive intelligence, complying with the terms-of-service expectations of their own customers — but the toolkit has converged. There's a relatively short list of techniques that get deployed, with weights tuned by site type. Understanding what's in the kit is what lets you (a) know which targets are reasonable to scrape and which aren't, and (b) build scraping infrastructure that handles them without re-inventing solutions for each one.

What's in the anti-scraping toolkit

Most modern anti-scraping deployments are some subset of:

  • IP reputation and rate limiting. Datacenter IP ranges flagged at the edge; rate limits per IP / per account / per session. The cheapest first line; catches casual scraping outright.
  • Browser fingerprinting. Canvas hash, fonts, JS quirks, TLS handshake signatures. Default headless Chromium gets caught; coherent stealth setups can pass. See browser fingerprinting.
  • JavaScript challenges. A small JS payload runs in the visitor's browser, computes a token, returns it. Plain HTTP libraries that don't execute JS fail this stage.
  • CAPTCHAs and managed challenges. Visible challenges (image grids, hCaptcha) or invisible ones (Cloudflare Turnstile). See how do AI agents handle CAPTCHAs.
  • Signed-URL tokens. Content URLs that include a time-limited signature. Hot-linking the URL elsewhere fails because the signature expires or is bound to a session.
  • Behavioral analysis. Mouse-movement entropy, scroll cadence, key-press timing distributions over the first few seconds of interaction. Hardest to fake well.
  • ML-based traffic classification. Off-the-shelf models that score traffic on combined feature vectors. Output: pass / challenge / block routing.
  • Legal and terms-of-service prohibitions. Many sites' terms forbid automated access. Enforcement varies; the legal posture is part of the deterrent.

In practice, every site is running 2–5 of these together. A bank dashboard runs all eight; an open-data government portal often runs none.

Anti-scraping vs. anti-bot detection

Closely related; not the same:

  • Anti-bot detection is the broader category — every form of automation, hostile and benign, gets evaluated. The framing is "is this a bot?"
  • Anti-scraping is a specific subset focused on data-extraction motives. The framing is "is this a scraper, and do we want it here?"

Most modern systems collapse them into one product (Cloudflare, Akamai, Datadome ship both); the distinction matters for thinking about what each layer is for. Anti-bot tries to filter all automation; anti-scraping is willing to allow some automation (search-engine crawlers, monitoring bots, paid API customers) and just block scrapers.

How modern scraping responds to each layer

Each anti-scraping technique has a corresponding scraper-side response. The arms race has been going for a decade; the responses are mostly settled:

Anti-scraping techniqueScraper-side response
IP reputationBare datacenter IPResidential proxies
Rate limit per IPOne IP, many requestsRotation across IP pool, sticky sessions
FingerprintingDefault headlessCoherent stealth fingerprint
JS challengerequests.get()Real browser execution
CAPTCHAImage grid, TurnstileAutomatic solver + identity-aware avoidance
Signed URLsBare hot-linkCapture the signed URL within the session that issued it
BehavioralProgrammatic actionsBehavioral simulation, Web Bot Auth for authorized agents
ML classificationCoherent across all signalsCoherence across all signals (one mismatch and the score spikes)
Legal / ToSAccount abuse postureStay within the target's stated automation policy

The honest summary: each layer is bypassable in isolation; defeating one without defeating the others creates contradictions the next layer detects. Coherence across layers is what actually works.

What this means for your toolkit

For most production scraping in 2026, the answer is "use a web scraping API that handles the layers for you." Hand-rolling the full stack — proxies, stealth, captcha, ML behavioral simulation — is now a mature engineering project, not a weekend script. Notte's client.Session(proxies=True, solve_captchas=True) collapses 4–5 of the layers into two flags; the rest are configured below the SDK surface.

Skip the managed approach when you're scraping unprotected targets where none of the layers fire — government open-data, simple static sites, internal tools. There the simple requests + BeautifulSoup path is genuinely fine.

Common pitfalls

  • Treating one bypass as the whole job. "I have residential proxies, so I'm good" — until the fingerprint flags you. Or "I have stealth Chromium" — until the IP gets flagged. Coherent across all layers, or none.
  • Ignoring the terms of service. Technical bypass and legal compliance are different problems. Even if you can scrape a site, the terms may forbid it; check before scaling.
  • Assuming today's bypass works tomorrow. Anti-scraping vendors ship updates monthly. The right response is a managed API that updates with them, not a hand-rolled bypass that goes stale.

Key takeaways

  • Anti-scraping is the bundle of technical and legal defenses sites use to deter scraping — rate limits, fingerprinting, CAPTCHAs, signed URLs, behavioral analysis, ML classification, terms.
  • Each layer is bypassable; the combined posture requires coherence across every layer simultaneously.
  • The 2026 answer is usually a managed web scraping API — Notte collapses 4–5 layers behind two SDK flags.
  • For genuinely unprotected targets, traditional scrapers are still fine; the heavy stack is for protected commercial targets.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.