What is browser agent observability?

A failed agent run is the worst possible debugging surface. The model is stateless, the page state has moved on, the user's task description is fine, and yet the run produced something wrong. Without observability you have a single output that says "completed" or "failed" and no way to figure out why. Observability is the standing investment that turns "this agent broke, no idea where" into "step 7 of run abc123 misclicked the second tab; the LLM had stale page context." It's not optional for production stacks. The agent itself is barely debuggable; the telemetry is what makes it tractable.

What needs to be captured

A useful observability stack records six classes of artifact, ideally at every step:

Screenshots of the page before and after each action. The single highest-leverage debugging artifact — most failures are obvious within five seconds of looking at the screenshot pair.
DOM snapshots (or accessibility tree extracts) — the structured input the agent actually saw, separate from what a human would see.
The action emitted by the LLM — the structured tool call (target, operation, arguments) plus the model's reasoning if the model exposed it.
The action result — succeeded / failed, the resulting URL, any error returned by the browser layer.
The LLM prompts and responses themselves — what got sent to the model and what it actually returned.
A session recording — the live video / DOM replay of what the agent did, end to end, scrubbable.

Notte exposes most of this through the standard agent run object plus the session recording / replay tooling.

Where each artifact actually helps

Different failures live at different layers, and the right artifact disambiguates them:

	Best signal in
The agent clicked the wrong thing	Screenshot pair (before/after the click)
The LLM picked the wrong action	Prompt + response log (was the page context misleading?)
The page changed unexpectedly	DOM snapshot diff between consecutive steps
The action layer failed silently	Action result log (succeeded=false plus reason)
The whole run "succeeded" but didn't	Session recording compared to the goal
The agent looped on the same step	Action history sequence

The observation: no single artifact debugs every failure. A useful stack captures all six at every step and makes them easy to scrub through.

Storing it without violating ZDR

Observability and zero data retention are in tension by design. Screenshots and DOM snapshots are exactly the kind of session content ZDR is supposed to drop. Production stacks resolve this in two ways:

Retain by default for development; drop by default for production. The same agent can be deployed with full observability in dev and ZDR in production. Notte exposes this as a session-level configuration, not a global toggle.
Mask aggressively when retention is on. PII in screenshots gets redacted; secrets in DOM extracts get scrubbed. The retained artifact is debuggable but not a compliance liability. See PII handling.

Skipping observability entirely "for compliance" is the wrong move — you lose the ability to debug, and the alternative isn't more compliant, it's blind. The right move is per-environment scoping plus masking.

What replay actually unlocks

The single highest-value observability artifact is the replay: a scrubbable timeline showing what the agent saw at each step, what it tried, and what happened next. Three things become tractable that were impossible without it:

Diffing two runs. "Yesterday this worked, today it doesn't." Side-by-side replay surfaces the page-level difference fast.
Sharing failures. A replay link is a 30-second handoff to whoever's going to fix the bug. Nothing else compresses an agent failure that efficiently.
Verifying claimed successes. "The agent says it completed; let me watch the run." Catches the run-level failures a binary success flag misses.

Common pitfalls

Capturing only the final state. The bug is almost never on the last step; it's three steps earlier. Capture every step.
Skipping the prompt / response log. Without it, you can see the agent did the wrong thing but not why. The model's reasoning lives in the prompt-response pair.
Long retention with no masking. Screenshots of dashboards stored for 30 days are a PII liability. Mask or shorten retention to fit the use case.
No replay UX. Raw logs and screenshots are 10x less useful than a scrubbable timeline. Build the UI or use one that exists.

Key takeaways

Observability is the standing telemetry investment that makes agent failures debuggable: screenshots, DOM snapshots, LLM prompts/responses, action logs, session replay.
Different failure classes need different artifacts; a useful stack captures all of them at every step.
Observability and zero data retention are in tension; resolve with per-environment scoping plus PII masking, not by skipping observability.
The replay is the highest-leverage artifact — a scrubbable per-step timeline that compresses agent failures into a 30-second handoff.

What is browser agent observability?