MLOps#Observability #Agents #Reliability #Runbook

AI Agent Observability Runbook: What to Measure Before It Burns

A practical runbook from debugging an agent stack where HTTP was green, dashboards were calm, and the agent was quietly doing interpretive dance with tool calls.

Misha Lubich

May 28, 20264 min read

AI Agent Observability Runbook: What to Measure Before It Burns

The scariest agent incident I saw this quarter did not throw a 500. It returned 200 OK and gave a polished, wrong answer with the confidence of a VC partner explaining databases.

That is the problem with agent observability: uptime is not enough. The service can be alive while the reasoning path is on fire, the tool schema is drifting, and the fallback policy is quietly converting user trust into confetti.

This runbook came from instrumenting an internal agent workflow that routed research and support-style tasks through retrieval, tool calls, and structured output. The first dashboards told us nothing useful. The second version started catching failures before users did.

The first dashboard was basically a nightlight

We initially tracked:

request count
HTTP errors
latency
token cost

Useful, but incomplete. It told us whether the API was breathing. It did not tell us whether the agent was thinking, looping, guessing, or laundering stale retrieval through a nice paragraph.

The first red flag came from cost: one workflow had a 2.3x cost spike with no traffic spike. That should be impossible unless the agent is doing extra work. Spoiler: it was doing extra work. It was retrying one failing retrieval path and then asking a judge model to bless the chaos.

Layer 1: execution health

Capture for every run:

total steps
tool calls by tool type
retries per step
terminal status: success, fail, timeout, canceled
wall-clock duration and p95/p99
cost per completed workflow

This layer answers: "Is the engine running, and is it running like a normal engine instead of a Roomba trapped under a couch?"

Layer 2: decision quality

Execution health will not catch confident nonsense. Add behavior signals:

unsupported-claim rate from sampled outputs
policy compliance score
action reversals and retry loops
judge score for task completion
evidence freshness attached to final answer

This layer answers: "Is the engine making good choices?"

Agent Incident Triage Decision Tree

Layer 3: business impact

This is where teams get shy because the numbers are political. Measure them anyway.

For this stack, the useful metrics were:

manual escalation rate
time-to-resolution on agent-assisted tasks
user correction rate
repeated question rate within 24 hours
cost per successful task, not cost per request

Cost per request is a vanity metric. Cost per successful task is the adult table.

Failure signatures I now alert on

median step count jumps more than 30% week-over-week
one tool suddenly dominates call volume
quality score falls while latency stays flat
latency rises while quality stays flat
evidence age exceeds the source freshness SLO
judge calls grow faster than primary model calls

That last one is sneaky. If your judge usage grows faster than task volume, you might be using evals as a mop instead of fixing the leak.

The three dashboards I actually keep open

Runtime panel: throughput, queue depth, timeout rates, tool errors
Behavior panel: loops, judge scores, policy violations, source freshness
Outcome panel: escalation, resolution time, correction rate, cost per success

The important part is joining them by run ID. Without that, you get three dashboards and one meeting where everyone performs confidence.

Runbook response levels

SEV-3: local quality drop in non-critical workflow; gate rollout
SEV-2: sustained drift or frequent escalation in core workflow; partial rollback
SEV-1: safety/policy failures or major outcome regression; deterministic fallback

The boring rule: rollback should not require a new deployment. If you need to ship code to stop the bleeding, your feature flag strategy is not a strategy. It is a wish with YAML.

Takeaway

Monitor agent systems like products, not like endpoints. Uptime tells you the server is alive. Observability tells you whether the agent is doing the job without quietly converting GPU budget into performance art.

#Observability #Agents #Reliability #Runbook #Monitoring

Back to all posts

MLOps4 min1k views

The Eval Budgeting Playbook for 2026

If your AI budget has tokens but no eval line item, you did not make a budget. You made a very confident wish with a model invoice attached.

May 2, 2026Read more →

MLOps1 min1k views

Silent Tool Failures Are the Quiet Killer of Agent Reliability

The model says the row was updated. The audit log disagrees. Until you treat tool I/O like distributed systems, agents will keep shipping confident lies.

April 5, 2026Read more →

MLOps3 min1k views

AI Evaluation Is the Hardest Unsolved Problem in Engineering

We've gotten incredibly good at building AI systems. We're still terrible at knowing whether they actually work. Evals are the bottleneck nobody's fixing.

September 1, 2025Read more →