How do websites detect scrapers?

Websites detect scrapers by combining several signals at different layers: IP reputation and ASN, TLS and HTTP fingerprints, JavaScript challenge results, browser-environment probes, behavioral patterns (mouse, scroll, timing), and request-rate / sequence anomalies over time. Each signal is bypassable alone; the verdict is the combined risk score, and most systems treat coherence across layers as the strongest signal.
How do websites detect scrapers?
Detection isn't one trick — it's a pipeline of weak signals fused into a strong verdict. By the time a scraper has been browsing for ten seconds, six classes of signal have been collected, scored, and combined; by the time the scraper has been on the site for ten minutes, three more classes have been added. Each signal in isolation is easy to fool; the combined verdict is hard, because faking one signal cleanly often creates a contradiction with another. The whole architecture is designed around catching the contradictions.
Six classes of signal
Production detection systems read all of these on every visit:
- IP reputation. ASN, geolocation, history of past abuse, presence on commercial-proxy lists. The cheapest layer; runs before the page renders.
- TLS and HTTP fingerprints. TLS handshake signature (JA3/JA4), HTTP/2 frame ordering, header order. Each browser and each scraping library has a distinctive signature here.
- JavaScript-environment probes. Run after the page loads. Reads
navigator.webdriver, the canvas-rendering hash, the audio-context output, font enumeration, and dozens of headless-mode tells. - Behavioral patterns. Mouse-movement entropy, scroll cadence, key-press intervals over the first 5–10 seconds. Real users move imperfectly; bots move too perfectly.
- Rate and sequence anomalies. How fast the scraper navigates, what order pages are visited in, how long it dwells on each. Real users explore irregularly; bots tend toward optimal-path patterns.
- Cross-request consistency. The same scraper hitting twenty pages should look like the same browser doing it. Subtle drift in fingerprint or behavior across requests is detection-grade signal.
The output is a risk score. Common thresholds: < 30 → pass, 30–70 → CAPTCHA challenge, > 70 → block.
What each layer actually catches
| Catches | Misses | |
|---|---|---|
| IP reputation | Naive scraping from datacenter IPs | Anyone using residential proxies |
| TLS / HTTP | requests, curl, httpx, Go stdlib clients | Real-browser traffic |
| JS probes | Default Playwright/Puppeteer, vanilla headless | Properly aligned stealth setups |
| Behavioral | Programmatic action sequences | Sophisticated behavioral simulation |
| Rate / sequence | Optimized scraper paths, even pacing | Scrapers that mimic human exploration |
| Cross-request | Drift across many requests | Stable identity-aware sessions |
Most scraping that gets blocked in the wild is caught at layers 1–2 (cheapest, highest precision). Layers 3–6 catch the long tail of stealth-equipped scrapers.
Where coherence comes in
A residential IP from Brazil with Accept-Language: en-US and macOS-only fonts is more suspicious than a coherent datacenter request. Each individual signal is plausible; the combination is not. Modern detection systems weight coherence heavily — they're not just looking for known-bad signatures, they're looking for stories that don't add up.
This is what makes hand-rolled scraping fragile. Defeating one layer in isolation often creates a contradiction the next layer detects. The whole reason managed scraping APIs and cloud browsers exist is to keep the layers coherent without requiring you to coordinate them yourself.
Detection over time
Detection isn't a one-shot decision at the first request. Three temporal patterns:
- Single-request scoring. Most signals score on the first request. Caught here, no further action.
- Within-session aggregation. Behavioral signals accumulate over the first seconds-to-minutes of the session. A scraper might pass the first request and fail at second 8.
- Cross-session pattern detection. Multi-request and multi-session patterns get evaluated on a longer horizon. A scraper might pass any individual request but get flagged after 50 requests show the same behavioral signature.
The third class is the hardest to evade — and the reason long-running scrapes against the same target need to vary their patterns deliberately, not just rotate IPs.
What this means for your scraping setup
The honest summary: detection has won the arms race against hand-rolled scrapers for any commercial target. The cases where simple requests + BeautifulSoup still works are unprotected open-data sites and internal tools you have permission to access. For everything else, you need either:
- A managed scraping API that keeps the layers coherent for you, or
- A verified-agent identity that bypasses the whole detection pipeline by being legitimately authenticated.
Hand-rolling all six layers (residential proxy + stealth fingerprint + JS-passing + behavioral simulation + rate shaping + cross-session consistency) is now a mature engineering project. Most teams find it isn't differentiating for their product.
Common pitfalls
- Treating one bypass as the whole job. Residential proxies don't help if your fingerprint screams "headless Chromium." Stealth Chromium doesn't help if your IP is on every reputation list.
- Coherent first request, drifting later requests. Easy to set up the first request well; hard to maintain coherence across 1000 requests with the same identity. Sessions and profiles fix this.
- Optimal-path scraping. Hitting only the URLs you need, in the order you need them, with no exploration. Real users don't do that. Add some realistic noise.
- Ignoring the temporal axis. Detection isn't single-request — patterns over time get evaluated separately.
Key takeaways
- Websites detect scrapers by combining six classes of signal — IP, TLS/HTTP fingerprint, JS probes, behavior, rate patterns, cross-request consistency — into a single risk score.
- Each layer is bypassable alone; coherence across layers is what's actually being tested.
- The arms race has settled in detection's favor for hand-rolled scrapers against commercial targets; managed APIs or verified-agent identities are the production answers.
- Detection runs both within a request and over time; the temporal axis catches scrapers that pass single-request checks.