Skip to main content

How does a browser agent perceive a page (vision vs DOM)?

How does a browser agent perceive a page (vision vs DOM)?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Browser agents perceive a page in one of three ways: vision-first (screenshot in, multimodal model out), DOM-first (accessibility tree or filtered DOM in, text-only model out), or hybrid (DOM as the primary signal, vision as fallback). Vision is universal but slow and expensive; DOM is fast and precise but breaks on canvas, video, and shadow-heavy UIs. Production stacks have converged on hybrid.

How does a browser agent perceive a page?

Every browser agent has to answer one question on every iteration: what does the page look like right now? The answer is what gets fed to the LLM, and the choice of representation drives almost every property of the agent — latency per step, cost per step, what kinds of sites it handles, how fragile it is when the UI changes. There's no neutral choice; each path costs something. The interesting work in 2026 has been figuring out where each one wins and how to compose them.

Three perception modes

There are three real options, plus increasingly common hybrids:

  1. Vision-first. A screenshot is sent to a multimodal model (GPT-4o-class, Claude with computer use, Gemini Vision). The model reasons about the image, picks a target by pixel coordinates, emits an action like click(x=420, y=275). Universal — works on anything that renders pixels — but expensive and slower per step. The architecture used by computer use models.
  2. DOM-first. The agent extracts the accessibility tree (or a filtered DOM, or both), serializes it as structured text, and feeds it to a text-only model. Actions target elements by ID, role, or selector. Fast, cheap, accurate on text-heavy UIs. Falls apart on canvas content, video players, and pages with heavy shadow DOM.
  3. Hybrid. DOM-first as the default; vision as a fallback when the DOM doesn't carry enough signal. The page comes in as both — structured text for the easy parts, a screenshot for anything the text representation can't disambiguate. Production agents are converging on this shape.

Trade-offs in practice

Vision-firstDOM-firstHybrid
Latency per stepHigh (vision inference)Low (text-only inference)Medium
Cost per stepHigh (vision tokens)LowMedium
Text-heavy SaaSSlower, sometimes wrongFast, accurateFast, accurate
Canvas / video / WebGLWorksFailsWorks (via vision fallback)
UI scaling and themingSensitive (pixels move)Robust (text doesn't)Robust
Cross-platform reachAnything renderedBrowser onlyBrowser only
Implementation complexityLowestMediumHighest

The summary: pick vision-first when you need the universality and can absorb the cost. Pick DOM-first when you're confident every page you'll touch has a useful DOM. Pick hybrid when the cost-vs-coverage curve actually matters — which is most production deployments.

What "the DOM" actually means in this context

A common confusion: there's no single "the DOM" representation an agent uses. Three flavors get mixed up:

  • Raw DOM. Every node, every attribute, every script. Far too large to send to a model — a typical page is hundreds of KB to several MB serialized.
  • Filtered DOM. Just the interactive elements (buttons, inputs, links) plus surrounding text. Much smaller; easier for the model to reason over.
  • Accessibility tree. What screen readers consume — a clean, role-based representation that maps closely to how a human perceives the page.

Most "DOM-first" agents in 2026 actually use a filtered DOM or the accessibility tree, not the raw DOM. Sending the raw DOM would blow the context budget on a single observation step.

Why hybrid is winning

Two pressures push production agents toward hybrid:

  • Coverage. A pure DOM agent trips over the long tail: image-heavy product pages, charts, drag-and-drop interfaces, anything in an iframe with cross-origin restrictions. Pure vision is overkill for the 80% of pages that have a clean DOM.
  • Cost shape. Vision inference is cheap when used selectively, expensive when used on every step. Hybrid pushes vision to the rare cases where DOM grounding fails, keeping per-run cost close to DOM-only.

Notte's perception layer leans hybrid. The agent prefers structured DOM observation for speed and accuracy and falls back to vision when an action target can't be resolved from text — for example, when the click needs to land on a chart element or an image-rendered button label.

Common pitfalls

  • Benchmarking on screenshots-only tasks. Some computer-use evals (parts of OSWorld, some red-team probes) score DOM-first agents poorly because the DOM input isn't allowed by the benchmark. Real-world workflows usually have a DOM.
  • Sending the raw DOM to a model. Context-window exhaustion in one step. Filter aggressively or use the accessibility tree.
  • Pure vision on text-heavy SaaS. Slow, expensive, and you'll still get OCR errors on small fonts. DOM input is faster and more accurate where it's available.

Key takeaways

  • Browser agents perceive pages three ways: vision-first (universal, expensive), DOM-first (fast, breaks on canvas), or hybrid (DOM-primary, vision-as-fallback).
  • "DOM-first" almost always means filtered DOM or accessibility tree, not raw DOM — raw is far too large for a model context.
  • Hybrid has won for production: it keeps per-run cost near DOM-only on most pages while preserving universality on the long tail.
  • The closest cousin to vision-first is computer use (CUA); the design pattern is the same, just at full-OS scope.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.