Skip to main content

What is computer use (CUA)?

What is computer use (CUA)?
Lucas Giordano's avatarBy Lucas Giordano · Co-founder, Notte
Last updated
TL;DR

Computer use models — abbreviated CUA — are AI systems trained to look at a screenshot, decide where to click or what to type, and emit pixel-level actions the way a human would. Anthropic's Claude with computer use, OpenAI's Operator, and similar offerings are the canonical examples. CUA wins on framework-agnostic flexibility (it works on anything that renders pixels). It loses on speed, cost, and accuracy on text-heavy UIs, where DOM-aware browser agents still dominate.

What is computer use (CUA)?

Most browser agents in production read the page through the DOM. They know what's a button, what's an input, what the labels say, because the browser hands that information over directly. Computer use is the bet that you don't actually need any of that — you can give a model the same screenshot a human would see, ask it to click somewhere, and get the right answer. It's the same observe-think-act loop as a DOM-based browser agent, but the observation is pixels and the actions are coordinates instead of element targets.

What CUA models actually do

A CUA model is a vision-language model fine-tuned (or, in some cases, trained from scratch) to:

  • Take a screenshot as input.
  • Reason about it relative to the user's goal.
  • Emit structured actions in pixel-space: click(x=420, y=275), type("hello"), key("Tab"), scroll(direction="down", amount=300), drag(from=…, to=…).

The model doesn't see the DOM. It doesn't read the accessibility tree. It looks at an image the same way a human does, and clicks where it thinks the right thing is. That's the entire core idea — and the reason the surface generalizes.

The canonical examples today are Anthropic's Claude with computer use, OpenAI's Operator (the Computer-Using Agent), and a small number of open-weight models targeting the same shape (UI-TARS, Cogagent variants, others). All of them share the same input/output contract; they differ in which puzzles they happen to be best at.

Where CUA wins

CUA's pitch is generality. Three places that translates into real wins:

  • Native desktop applications. A Mac app, an Excel spreadsheet, an enterprise tool that has no DOM at all. Browser agents are by definition useless here; CUA isn't.
  • Canvas-based, video, or rendered-only UIs. Sites where critical interactions happen on a <canvas> (drawing tools, custom design apps, some graph editors) leave the DOM nearly empty. CUA reads what the user reads.
  • Cross-environment portability. A workflow that has to span a browser, a terminal, and a desktop app collapses cleanly under one model with one action language.

Outside those, CUA's advantage thins out fast.

Where DOM-based browser agents still beat CUA

Inside the browser, CUA pays a real cost:

DOM-based browser agentComputer use (CUA)
ObservationStructured DOM + accessibility tree (often + screenshot)Screenshot only
Action targetElement ID / selectorPixel coordinates
Latency per stepLower (cheap to read DOM)Higher (vision inference per step)
Cost per stepLowerHigher (vision tokens)
Accuracy on text-heavy UIsHigherLower (OCR + grounding errors)
Robustness to UI scaling / themingHigh (text doesn't move)Moderate (pixel locations shift)
Reach beyond the browserBrowser onlyAnything rendered

The summary: for browser-only workflows on text-heavy SaaS — the 80% case for production agents — DOM-based agents are faster, cheaper, and more accurate. CUA earns its keep when the task can't fit in a DOM-shaped box.

Hybrid patterns

The interesting frontier isn't "DOM vs. CUA" — it's combining them. Production stacks increasingly run a DOM-first agent with vision as a fallback: try to ground actions against the DOM; if the target element can't be resolved (canvas, image-text, dynamic shadow content), drop down to a CUA-style screenshot pass for that step.

Notte's perception layer leans this direction. The agent prefers DOM grounding for speed and accuracy, falls back to vision when the DOM doesn't carry enough signal. See how does a browser agent perceive a page (vision vs DOM) for the deeper trade-off.

Common pitfalls

  • Treating CUA as the default for browser work. It's not. Browser-native agents win on cost and accuracy; reach for CUA when DOM grounding fails.
  • Benchmarking on screenshots-only tasks. CUA-only benchmarks (some of OSWorld, some computer-use evals) understate DOM-based agents because the DOM input isn't allowed. Real-world workflows usually have a DOM.
  • Underestimating the latency tax. A vision call per step is seconds; a DOM extract is milliseconds. Multi-step flows compound.

Key takeaways

  • Computer use (CUA) is a class of AI models that operate a computer through screenshots and pixel-space actions, the way a human would.
  • It generalizes beyond the browser — native apps, canvas UIs, cross-environment workflows — at the cost of speed, accuracy, and per-step price.
  • Inside the browser on text-heavy UIs, DOM-based agents beat CUA on every dimension that matters in production.
  • Hybrid agents (DOM-first, vision-as-fallback) are the emerging answer; see vision vs DOM for how the layers compose.

Build your AI agent on the open web with Notte

Cloud browsers, agent identities, and the Anything API — everything you need to ship reliable browser agents in production.