Skip to main content

How does bot detection work?

How does bot detection work?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Bot detection runs as a pipeline of cheap-to-expensive checks: passive signals (IP reputation, TLS fingerprint) before the page renders, active JavaScript probes once the browser executes, behavioral analysis over the next few seconds of interaction. Each stage filters traffic to the next; most blocks happen at the first stage. The combined output is a risk score that drives pass / challenge / block routing.

How does bot detection work?

Detection is a pipeline, not a single algorithm. By the time a visitor reaches the page itself, three classes of signal have already been collected, weighted, and combined; by the time they've spent five seconds on the page, two more classes have been added. The whole architecture is designed so the cheapest signals run first (because they catch most automation cheaply) and the expensive ones only run on traffic that survives the cheap layers. Knowing what runs at each stage is what makes stealth automation work — you have to align all the signals, not just one.

The pipeline, stage by stage

A modern detection system (Cloudflare, Akamai, Datadome, Imperva, in-house variants) runs roughly this order:

Stage 1 — IP / ASN reputation

Before any HTML or JavaScript loads:

  • The connecting IP is looked up in a reputation database — known commercial proxies, datacenter ranges, TOR exits, abuse-history records.
  • The ASN's history of abusive traffic is factored in.
  • Geolocation gets compared to expected traffic patterns for the site.

The cheapest layer; catches most casual scraping outright. Datacenter IPs from AWS / GCP / Hetzner / DigitalOcean get heavily penalized by default. Residential IPs from clean providers pass.

Stage 2 — TLS and HTTP fingerprint

Still before any JS runs:

  • The TLS handshake (cipher suite ordering, extensions, JA3/JA4 hash) gets compared to a database of known clients. Real Chrome's handshake differs from requests, curl, Python httpx, Go's stdlib client. Each leaves a distinctive signature.
  • HTTP/2 frame ordering, header order, and pseudo-header sequencing also leave fingerprints.

This catches almost all naive scraping libraries. Even "headless Chromium" has subtle differences from headful Chrome at this layer. Mitigated by using a real browser, not by spoofing headers in requests.

Stage 3 — JavaScript challenge

The page returns a small piece of JavaScript that runs in the visitor's browser:

  • Reads navigator.webdriver, the user-agent, the canvas-rendering hash, the audio-context output.
  • Tests for headless tells (presence/absence of certain Chrome APIs, missing fonts, default screen sizes).
  • Computes the browser fingerprint and submits it.
  • Optionally returns a token the page validates server-side before serving real content.

This is where vanilla Playwright and Puppeteer get caught. Real stealth setups patch dozens of these tells.

Stage 4 — Behavioral analysis

Once the page is interactive:

  • Mouse-movement entropy is measured (real users move in slightly-curved paths with imperfect timing; bots move in straight lines or perfect curves).
  • Scroll cadence and acceleration get logged.
  • Key-press intervals get statistical-distribution-tested.
  • Time between page interactions is compared to human-population baselines.

The hardest layer to fake well. Sophisticated stealth setups ship behavioral simulation; cheap ones don't.

Stage 5 — Long-tail signals

Across multiple requests in a session:

  • Session-cookie consistency over time.
  • Sequence of pages visited (real users explore; bots take direct paths).
  • Geolocation drift.
  • Device-fingerprint stability across requests.

Catches anything that survived the first four layers.

What each layer actually catches

CatchesMisses
Stage 1 (IP)Casual scraping, anyone using bare datacenter IPsAnyone with residential proxies
Stage 2 (TLS / HTTP)Naive HTTP libraries, scripted requestsReal-browser-driven traffic
Stage 3 (JS / fingerprint)Vanilla Playwright/Puppeteer, default headlessProperly aligned stealth setups
Stage 4 (behavioral)Programmatic action sequences, bot fleetsCarefully-simulated behavioral patterns
Stage 5 (long-tail)Anything that doesn't behave like a real user across requestsCoherent identity-aware stealth

How the verdict gets routed

Output of the pipeline is a risk score, not a yes/no. Common routing:

  • Score < 30 — pass. Real content gets served, no challenge.
  • Score 30–70 — challenge. A CAPTCHA, a managed-challenge interstitial, sometimes a rate limit.
  • Score > 70 — block. Either an outright deny, a 403, or a tarpit (fake content served slowly to waste the bot's time).

The thresholds are tunable per site; high-friction sites (banks, government portals) bias toward false-positives, low-friction sites toward false-negatives.

Common pitfalls

  • Treating "I made it past the IP check" as "I passed detection." Stage 1 is the easiest stage. Most blocks happen at stages 2–4.
  • Optimizing for stage 3 only. Defeating the JS fingerprint without realistic behavior catches you at stage 4.
  • Forgetting stage 5. A coherent first-request profile that drifts across requests gets caught downstream.

Key takeaways

  • Bot detection runs as a pipeline of stages, cheap-to-expensive: IP reputation → TLS/HTTP fingerprint → JS-challenge fingerprint → behavioral analysis → long-tail consistency.
  • Each stage catches a different tier of automation; the combined verdict is a risk score, not a binary.
  • Defeating one stage in isolation creates contradictions the next stage detects; coherent stealth across every stage is the only durable strategy.
  • For legitimate agents, the medium-term answer is cryptographic identity via Web Bot Auth — a separate path that bypasses the whole anti-bot stack.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.