Skip to main content

What is scraping behind authentication?

What is scraping behind authentication?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Scraping behind authentication is the practice of extracting data from logged-in pages — dashboards, internal tools, supplier portals, anything gated by a login. The actually-valuable web data lives there. Doing it cleanly requires authentication that survives 2FA, session state that survives across runs, and the discipline to do it within the bounds of the target's terms. The whole thing collapses to a one-call shape with the right primitives.

What is scraping behind authentication?

The cleanest scraping work is on public pages — the open web, server-rendered HTML, no login wall. The valuable scraping work is rarely on those pages. Internal SaaS dashboards have the operational metrics. Supplier portals have the invoices. Banking interfaces have the transaction history. None of it is on the open web; all of it requires you to log in. Scraping behind authentication is the bundle of techniques that makes that surface reachable from automation — and the difference between a research project and a workflow that actually ships in production.

Why this used to be hard

Two architectural assumptions of authenticated sites broke every old scraping playbook:

  • Sessions expire. Cookie-based logins last hours-to-days; cookie-paste-into-script wears out fast and requires a human to refresh.
  • 2FA is everywhere now. Almost any meaningful site issues an SMS or email code on login; a script with no phone number and no inbox can't get past it.

The standard pre-2024 workaround was: a human logs in once, exports cookies, pastes them into a script, the script runs against those cookies until they expire (typically days, sometimes hours), then the human re-authenticates. Works for a single workflow run by a single team. Breaks every multi-user product. Breaks every async workflow. Breaks everything that should run unattended.

What actually works in 2026

Three primitives make authenticated scraping clean. Pick by the shape of the auth flow:

  1. Vault + agent login. The agent fills a standard email-and-password form. The credential lives in a vault and never enters the LLM context. Best for sites with predictable login flows and no 2FA (or 2FA the agent can solve via identity).
  2. Digital identity. A persistent identity attached to the agent — its own email, SMS-capable phone, fingerprint. Solves 2FA end-to-end without a human in the loop. Best for sign-up flows and sites that re-challenge.
  3. Browser profile. A human (or you, the developer) completes the login once via a viewer with persist=True. Subsequent agent runs attach the profile and start authenticated. Best for SSO, sticky-device flows, and anything an agent can't reliably solve from scratch.

The full decision tree lives at how do AI agents log into authenticated websites. Authenticated scraping is that decision tree applied to the specific case of "the agent's job is to extract data."

The Notte SDK shape

The simplest case — a vault-backed agent extracting data from a known login flow:

main.py
from notte_sdk import NotteClient
from pydantic import BaseModel

client = NotteClient()
vault = client.Vault(vault_id="my_vault")
vault.add_credentials(
    url="https://supplier-portal.com",
    email="agent@example.com",
    password="...",
)

class Invoice(BaseModel):
    invoice_id: str
    amount_eur: float
    due_date: str

with client.Session(proxies=True) as session:
    agent = client.Agent(vault=vault, session=session, max_steps=15)
    response = agent.run(
        task="Log into the supplier portal, navigate to invoices, "
             "and return the unpaid invoices from this quarter.",
        response_format=list[Invoice],
    )

for invoice in response.output:
    print(invoice.invoice_id, invoice.amount_eur)

One call. Auth, navigation, extraction, schema validation — all of it inside agent.run(...). Replace the vault with a digital identity for sign-up flows; replace the session with a profile-attached session for SSO.

What this isn't

Two things authenticated scraping is not, both common confusions:

  • Scraping public pages while logged in. If the data is reachable without a login, you don't need any of this — use a regular scraping API. Authenticated scraping is for data only available to logged-in users.
  • Brute-forcing accounts. Authenticated scraping uses credentials you legitimately hold (or the user gave you, or they belong to your test account). Anything else is account abuse, not scraping.

When it's the right tool

Reach for authenticated scraping when all of these are true:

  • The data you need is genuinely behind a login.
  • You hold valid credentials (your own account, or your customer's account they enrolled with you).
  • The target site's terms permit automated access for your account.
  • The workflow needs to run repeatedly or unattended.

The terms-of-service question is real. Many SaaS terms permit automated access for the account holder; some don't. Check the specific target's terms before deploying at scale.

Common pitfalls

  • Persisting cookies by hand. Modern apps store auth in localStorage that simple cookie-pickling misses. Use a profile instead.
  • Single shared corporate account for multi-user work. Per-user data needs per-user identities or per-user profiles. One shared account is a compliance and isolation liability.
  • No verifier on the post-login state. "The agent reported it scraped" — but did it land on the dashboard, or on a 2FA prompt? Verify the post-login state before trusting the output.
  • Hammering rate limits. Authenticated APIs (and authenticated UIs) have rate limits. Coordinate timing with the target's tolerance — see how do websites detect scrapers.

Key takeaways

  • Scraping behind authentication is extracting data from logged-in pages — the surface where valuable web data actually lives.
  • Three primitives cover most cases: vault + agent login (predictable forms), digital identity (sign-ups + 2FA), browser profile (SSO + sticky-device flows).
  • Notte's client.Agent(vault=..., session=...) collapses auth + navigation + extraction + validation into one call.
  • Honest scope: this is for data your account can legitimately access. Check the target's terms; this isn't a tool for account abuse.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.