What is computer use (CUA)?

Most browser agents in production read the page through the DOM. They know what's a button, what's an input, what the labels say, because the browser hands that information over directly. Computer use is the bet that you don't actually need any of that — you can give a model the same screenshot a human would see, ask it to click somewhere, and get the right answer. It's the same observe-think-act loop as a DOM-based browser agent, but the observation is pixels and the actions are coordinates instead of element targets.

What CUA models actually do

A CUA model is a vision-language model fine-tuned (or, in some cases, trained from scratch) to:

Take a screenshot as input.
Reason about it relative to the user's goal.
Emit structured actions in pixel-space: click(x=420, y=275), type("hello"), key("Tab"), scroll(direction="down", amount=300), drag(from=…, to=…).

The model doesn't see the DOM. It doesn't read the accessibility tree. It looks at an image the same way a human does, and clicks where it thinks the right thing is. That's the entire core idea — and the reason the surface generalizes.

The canonical examples today are Anthropic's Claude with computer use, OpenAI's Operator (the Computer-Using Agent), and a small number of open-weight models targeting the same shape (UI-TARS, Cogagent variants, others). All of them share the same input/output contract; they differ in which puzzles they happen to be best at.

Where CUA wins

CUA's pitch is generality. Three places that translates into real wins:

Native desktop applications. A Mac app, an Excel spreadsheet, an enterprise tool that has no DOM at all. Browser agents are by definition useless here; CUA isn't.
Canvas-based, video, or rendered-only UIs. Sites where critical interactions happen on a <canvas> (drawing tools, custom design apps, some graph editors) leave the DOM nearly empty. CUA reads what the user reads.
Cross-environment portability. A workflow that has to span a browser, a terminal, and a desktop app collapses cleanly under one model with one action language.

Outside those, CUA's advantage thins out fast.

Where DOM-based browser agents still beat CUA

Inside the browser, CUA pays a real cost:

	DOM-based browser agent	Computer use (CUA)
Observation	Structured DOM + accessibility tree (often + screenshot)	Screenshot only
Action target	Element ID / selector	Pixel coordinates
Latency per step	Lower (cheap to read DOM)	Higher (vision inference per step)
Cost per step	Lower	Higher (vision tokens)
Accuracy on text-heavy UIs	Higher	Lower (OCR + grounding errors)
Robustness to UI scaling / theming	High (text doesn't move)	Moderate (pixel locations shift)
Reach beyond the browser	Browser only	Anything rendered

The summary: for browser-only workflows on text-heavy SaaS — the 80% case for production agents — DOM-based agents are faster, cheaper, and more accurate. CUA earns its keep when the task can't fit in a DOM-shaped box.

Hybrid patterns

The interesting frontier isn't "DOM vs. CUA" — it's combining them. Production stacks increasingly run a DOM-first agent with vision as a fallback: try to ground actions against the DOM; if the target element can't be resolved (canvas, image-text, dynamic shadow content), drop down to a CUA-style screenshot pass for that step.

Notte's perception layer leans this direction. The agent prefers DOM grounding for speed and accuracy, falls back to vision when the DOM doesn't carry enough signal. See how does a browser agent perceive a page (vision vs DOM) for the deeper trade-off.

Common pitfalls

Treating CUA as the default for browser work. It's not. Browser-native agents win on cost and accuracy; reach for CUA when DOM grounding fails.
Benchmarking on screenshots-only tasks. CUA-only benchmarks (some of OSWorld, some computer-use evals) understate DOM-based agents because the DOM input isn't allowed. Real-world workflows usually have a DOM.
Underestimating the latency tax. A vision call per step is seconds; a DOM extract is milliseconds. Multi-step flows compound.

Key takeaways

Computer use (CUA) is a class of AI models that operate a computer through screenshots and pixel-space actions, the way a human would.
It generalizes beyond the browser — native apps, canvas UIs, cross-environment workflows — at the cost of speed, accuracy, and per-step price.
Inside the browser on text-heavy UIs, DOM-based agents beat CUA on every dimension that matters in production.
Hybrid agents (DOM-first, vision-as-fallback) are the emerging answer; see vision vs DOM for how the layers compose.

What is computer use (CUA)?