What is a browser agent?

A browser agent is an AI system that perceives a webpage, decides what to do, and executes the action through a real browser — repeating the loop until a task is complete. Browser agents combine an LLM (the planner), a perception layer (DOM and/or vision), and a controlled browser session, and they are the primary way LLMs interact with websites that don't expose APIs.
What is a browser agent?
A browser agent is an AI system that uses a real web browser as its tool of action. Given a high-level goal in natural language — "book me a flight from Paris to Tokyo next Tuesday under €800" — the agent perceives the current state of the browser, decides what to do next, executes the action (a click, a keystroke, a scroll), and repeats the loop until the goal is reached or it gives up.
The "agent" part is what separates a browser agent from a script: there is no fixed sequence of steps. The agent re-plans on every iteration based on what the page now shows. The "browser" part is what separates a browser agent from a generic LLM: instead of talking to a server-side API that may or may not exist, it manipulates the same UI a human would.
Browser agents have become the dominant way to give large language models the ability to act on the web. Most of the long tail of useful services — internal dashboards, government portals, niche SaaS, anything behind a login — has no public API. A browser agent is how an LLM reaches them.
The problem browser agents solve
The common framing is "only 15% of the web is reachable through APIs." The other 85% lives behind login walls, dynamic SPAs, anti-bot defenses, and UIs that change every quarter. Three older approaches all struggle with this surface:
- Traditional scrapers (request-and-parse) break on JavaScript-rendered SPAs and don't handle authentication, multi-step flows, or layout changes.
- RPA tools (record-and-replay) work on stable internal apps but shatter the moment a button moves or a class name changes.
- Custom Playwright/Puppeteer scripts are flexible but expensive to write and even more expensive to maintain — every site change is a maintenance ticket.
Browser agents solve this with a different bet: instead of hard-coding the path through a UI, let an LLM look at the page on every step and decide what to do. When the page changes, the agent's plan adapts automatically. When the workflow is a one-off, you don't write code at all — you describe the goal.
How browser agents work
Every browser agent — whether it's Notte, OpenAI's computer use, Anthropic's CUA, or a homegrown Stagehand stack — implements some variant of the same observe-think-act loop:
- Observe. Capture the current state of the browser. This might be a screenshot, a structured DOM extract, the accessibility tree, or all three. The richer the observation, the better the agent's decisions, but also the higher the per-step cost.
- Think. Feed the observation, the user's goal, and the recent action history into an LLM. The model outputs a structured action — click element id 42, type "Tokyo" into the destination input, scroll down 600px — usually as JSON via tool-calling.
- Act. Translate the structured action into a browser primitive (mouse click at coordinate, type into input, key press) and execute it through the Chrome DevTools Protocol or an equivalent.
- Repeat until the agent decides the goal is reached, raises a verifier signal, or hits a step budget.
Around this core loop, production agents add several layers:
- Memory — short-term state across the run, plus optional long-term episodic memory that persists across runs of the same task.
- Recovery — retries, backtracking to a known-good state, and falling back to vision when DOM grounding fails.
- Verifiers — a separate model or deterministic check that decides whether the goal was actually reached.
- Identity — a persistent digital identity with credentials, email, and phone so the agent can log in to authenticated sites without human help.
- Observability — per-step screenshots and action logs that make agent runs replayable and debuggable.
A minimal Notte agent looks like this:
from notte_sdk import NotteClient
client = NotteClient()
with client.Session() as session:
agent = client.Agent(session=session, max_steps=12)
response = agent.run(
task="Find the latest pricing on the Anthropic API page and return it as JSON.",
url="https://www.anthropic.com",
)
print(response.output)Behind that one call, Notte spins up an isolated cloud browser session, attaches a digital identity if the task needs authentication, runs the observe-think-act loop, captures a session recording for debugging, and returns the structured output.
Each run progresses through four explicit states — running while it iterates, completed on success, failed on an unrecoverable error, stopped when the step budget is exhausted before the task finishes. The step budget (max_steps) is the safety belt that bounds cost and latency; without it, a confused agent can loop forever.
Browser agent vs. computer use vs. RPA
These three terms get used interchangeably, but they're not the same.
| Browser agent | Computer use (CUA) | RPA | |
|---|---|---|---|
| Surface | Browser only | Whole desktop OS | Browser + desktop apps |
| Perception | DOM + vision | Vision (screenshots) | Selectors / coordinates |
| Decision making | LLM, re-plans every step | LLM, re-plans every step | Hard-coded script |
| Adapts to UI changes | Yes | Yes | No |
| Best for | Web automation, AI workflows | Anything on a screen | Stable internal tools |
| Cost per run | $$ | $$$ | $ |
A browser agent is the right shape for almost everything that lives in a browser, which is most modern SaaS. Computer use is the superset — slower and more expensive, but works on native apps. RPA is what you reach for when the target UI never changes and you need to run thousands of times a day at the lowest possible cost.
Common use cases
The use cases that have stuck across the first two years of agentic browsers cluster around five patterns:
- Authenticated data extraction. Pulling reports, invoices, or transaction history from dashboards that have no API. The agent logs in, navigates to the right page, and structures the output. See scraping behind authentication.
- Cross-system workflows. Reading data from one tool and entering it into another — moving leads from a webinar platform into a CRM, reconciling supplier portals against an ERP. The agent is glue between systems that don't talk to each other.
- Account creation and KYC. Signing up for services, filling onboarding forms, handling 2FA. Identity-aware agents do this without human intervention.
- AI assistant tool calls. A consumer AI assistant ("book me a table at 8") that invokes a browser agent to actually carry out the action.
- Continuous monitoring. Daily runs that check competitor pricing, watch a regulator's filings page, or sweep job boards — packaged as a scheduled workflow or an API endpoint.
When to use a browser agent
Reach for a browser agent when at least one of these is true:
- The target site has no usable API (or the API is too limited).
- The workflow has many steps and may branch based on what the page shows.
- The site changes often enough that a hand-written scraper would be a maintenance burden.
- The flow requires authentication, 2FA, or other identity-aware steps.
- You're writing the workflow in natural language and don't want to maintain selectors.
Don't reach for a browser agent when:
- A public API exists and is sufficient — call the API directly. Agents add latency and cost.
- The page is a stable, never-changing internal tool — hand-written automation is cheaper and more predictable.
- You need millisecond response times — agents trade latency for flexibility.
Key takeaways
- A browser agent is an LLM running an observe-think-act loop against a real browser. It re-plans on every step, which is what makes it adaptive.
- Browser agents reach the 85% of the web that doesn't have a clean API — authenticated apps, JavaScript SPAs, sites that change often.
- Production agents are the loop plus four wrappers: memory, recovery, verifiers, and identity.
- Browser agents differ from computer use (broader surface, higher cost) and from RPA (no LLM, no adaptation).
- Use them when adaptability matters; skip them when a stable API or a fixed UI is on offer.
If your workflow involves logged-in pages, the next read is what is agent identity. For the perception trade-offs (DOM vs. vision), see how does a browser agent perceive a page.