What is anti-scraping?

Almost every site of any commercial value runs some form of anti-scraping. The motivations vary — protecting proprietary content, controlling load, blocking competitive intelligence, complying with the terms-of-service expectations of their own customers — but the toolkit has converged. There's a relatively short list of techniques that get deployed, with weights tuned by site type. Understanding what's in the kit is what lets you (a) know which targets are reasonable to scrape and which aren't, and (b) build scraping infrastructure that handles them without re-inventing solutions for each one.

What's in the anti-scraping toolkit

Most modern anti-scraping deployments are some subset of:

IP reputation and rate limiting. Datacenter IP ranges flagged at the edge; rate limits per IP / per account / per session. The cheapest first line; catches casual scraping outright.
Browser fingerprinting. Canvas hash, fonts, JS quirks, TLS handshake signatures. Default headless Chromium gets caught; coherent stealth setups can pass. See browser fingerprinting.
JavaScript challenges. A small JS payload runs in the visitor's browser, computes a token, returns it. Plain HTTP libraries that don't execute JS fail this stage.
CAPTCHAs and managed challenges. Visible challenges (image grids, hCaptcha) or invisible ones (Cloudflare Turnstile). See how do AI agents handle CAPTCHAs.
Signed-URL tokens. Content URLs that include a time-limited signature. Hot-linking the URL elsewhere fails because the signature expires or is bound to a session.
Behavioral analysis. Mouse-movement entropy, scroll cadence, key-press timing distributions over the first few seconds of interaction. Hardest to fake well.
ML-based traffic classification. Off-the-shelf models that score traffic on combined feature vectors. Output: pass / challenge / block routing.
Legal and terms-of-service prohibitions. Many sites' terms forbid automated access. Enforcement varies; the legal posture is part of the deterrent.

In practice, every site is running 2–5 of these together. A bank dashboard runs all eight; an open-data government portal often runs none.

Anti-scraping vs. anti-bot detection

Closely related; not the same:

Anti-bot detection is the broader category — every form of automation, hostile and benign, gets evaluated. The framing is "is this a bot?"
Anti-scraping is a specific subset focused on data-extraction motives. The framing is "is this a scraper, and do we want it here?"

Most modern systems collapse them into one product (Cloudflare, Akamai, Datadome ship both); the distinction matters for thinking about what each layer is for. Anti-bot tries to filter all automation; anti-scraping is willing to allow some automation (search-engine crawlers, monitoring bots, paid API customers) and just block scrapers.

How modern scraping responds to each layer

Each anti-scraping technique has a corresponding scraper-side response. The arms race has been going for a decade; the responses are mostly settled:

	Anti-scraping technique	Scraper-side response
IP reputation	Bare datacenter IP	Residential proxies
Rate limit per IP	One IP, many requests	Rotation across IP pool, sticky sessions
Fingerprinting	Default headless	Coherent stealth fingerprint
JS challenge	`requests.get()`	Real browser execution
CAPTCHA	Image grid, Turnstile	Automatic solver + identity-aware avoidance
Signed URLs	Bare hot-link	Capture the signed URL within the session that issued it
Behavioral	Programmatic actions	Behavioral simulation, Web Bot Auth for authorized agents
ML classification	Coherent across all signals	Coherence across all signals (one mismatch and the score spikes)
Legal / ToS	Account abuse posture	Stay within the target's stated automation policy

The honest summary: each layer is bypassable in isolation; defeating one without defeating the others creates contradictions the next layer detects. Coherence across layers is what actually works.

What this means for your toolkit

For most production scraping in 2026, the answer is "use a web scraping API that handles the layers for you." Hand-rolling the full stack — proxies, stealth, captcha, ML behavioral simulation — is now a mature engineering project, not a weekend script. Notte's client.Session(proxies=True, solve_captchas=True) collapses 4–5 of the layers into two flags; the rest are configured below the SDK surface.

Skip the managed approach when you're scraping unprotected targets where none of the layers fire — government open-data, simple static sites, internal tools. There the simple requests + BeautifulSoup path is genuinely fine.

Common pitfalls

Treating one bypass as the whole job. "I have residential proxies, so I'm good" — until the fingerprint flags you. Or "I have stealth Chromium" — until the IP gets flagged. Coherent across all layers, or none.
Ignoring the terms of service. Technical bypass and legal compliance are different problems. Even if you can scrape a site, the terms may forbid it; check before scaling.
Assuming today's bypass works tomorrow. Anti-scraping vendors ship updates monthly. The right response is a managed API that updates with them, not a hand-rolled bypass that goes stale.

Key takeaways

Anti-scraping is the bundle of technical and legal defenses sites use to deter scraping — rate limits, fingerprinting, CAPTCHAs, signed URLs, behavioral analysis, ML classification, terms.
Each layer is bypassable; the combined posture requires coherence across every layer simultaneously.
The 2026 answer is usually a managed web scraping API — Notte collapses 4–5 layers behind two SDK flags.
For genuinely unprotected targets, traditional scrapers are still fine; the heavy stack is for protected commercial targets.

What is anti-scraping?