AI Architecture#Agents #Architecture #Reliability #Postmortem

What Broke Our Agent Stack in Q2 (and How We Fixed It)

Q: What we still have not solved?

nuanced domain scoring still needs humans in the loop longhorizon memory across sessions is still easy to fake and hard to audit multitool degradation attribution remains messy when two dependencies fail at once Anyone claiming those are solved is probably selling a platform deck with a suspicious number of gradients.

A field report from a quarter where the demos looked great, the dashboards looked calm, and the agent stack quietly set small piles of money on fire.

Misha Lubich

June 15, 20267 min read

What Broke Our Agent Stack in Q2 (and How We Fixed It)

"Why did the agent spend $42 to fail the same way nine times?" is not the question you want to hear before coffee.

That was the mood of one Q2 review. From the outside, the agent work looked solid: faster prototypes, cleaner demos, fewer manual handoffs. Internally, we had the classic Silicon Valley disease where everything is "basically working" right up until the spreadsheet starts doing spoken-word poetry.

The project was a production-style agent stack for knowledge work: retrieval, tool calls, structured outputs, eval gates, and a small orchestration layer that routed tasks through specialist agents. Think support triage, internal research, and workflow automation—not a toy chatbot wearing a blazer.

The useful part: most failures were not mystical model failures. They were boring system failures. Boring is good. Boring means you can fix it without summoning a new foundation model and pretending the invoice is strategy.

The scorecard before we got serious

Before the fixes, the stack looked like this on our internal runs:

median task completion: 71% across 420 sampled agent runs
p95 latency: 18.4s on multi-tool workflows
repeated-failure loops: 11% of runs
manual escalation: 29% on policy-heavy tasks
most embarrassing metric: one agent called the same retrieval tool seven times with semantically identical queries and then concluded "insufficient evidence"

That last one is the AI equivalent of opening the fridge six times and blaming the refrigerator for not becoming dinner.

Q2 Agent Failure Mix Before Fixes

The uncomfortable lesson: only about 10% of the failures were meaningfully "the model is dumb." The rest were architecture, data, contracts, and incentives.

Failure mode 1: stale retrieval made good prompts look bad

The first bug looked like a prompt problem. Users said answers were "mostly right" but wrong on recently changed policy details. The reflex was to tweak the system prompt. We did, because apparently every AI engineer must pay the prompt-tax before admitting the database is stale.

The real issue: one ingestion path refreshed within 45 minutes. Another took 18-26 hours. Both fed the same retriever. The model got a confident, outdated paragraph and did exactly what we asked: synthesize the evidence. Beautifully wrong. Michelin-star hallucination, locally sourced.

What failed first

We tried adding "prefer recent evidence" to the prompt.
We tried increasing top-k from 8 to 16.
We tried a bigger reranker.

All three were expensive ways to avoid saying: "our index freshness is not a vibe; it needs an SLO."

What fixed it

freshness SLO per source class
ingestion lag alerts at 2x expected refresh window
test cases seeded with documents changed in the last 24 hours
response metadata that exposed source timestamp in review traces

After this, recent-document misses dropped from 19% to 4% in the sampled suite without changing the model.

Failure mode 2: schema drift at tool boundaries

A tool response changed from owner_name to owner.display_name. Perfectly reasonable refactor. Also a tiny landmine.

The agent did not crash cleanly. It degraded. It started routing tasks to a fallback owner and then asked the LLM to "infer" responsibility from surrounding text. That is how you get a very confident assistant assigning infra tickets to the person who wrote the README in 2022.

What fixed it

contract tests for every high-traffic tool payload
runtime validators at the tool boundary, not three layers later
typed error envelopes that agents can route on
canary deployment for tool schema changes

The conservative take: an agent should not be allowed to improvise around broken plumbing. If the contract breaks, fail loudly and hand the operator a useful error. Jazz is for clubs, not production incident routing.

Failure mode 3: retry loops inflated cost and latency

The ugliest failure was a retry loop. The agent called the same tool with slightly different wording, got the same failure, and tried again because the prompt told it to "be persistent."

This is where product language can sabotage infrastructure. "Be persistent" sounds cute in a demo. In production it means the agent becomes a golden retriever with an AmEx.

Retry Control Logic After Q2 Incidents

What fixed it

maximum retry budget per run
required strategy change between retries
no retry on validation errors unless a repair function exists
deterministic fallback when budget is exhausted

The result: p95 latency moved from 18.4s to 9.7s on the same scenario set, and median cost per successful run dropped roughly 34%.

Failure mode 4: eval coverage lagged feature velocity

We shipped features faster than we updated evals. That is not an eval problem. That is a management problem wearing a YAML hoodie.

The release board said green because the old suite was green. The new workflow introduced longer tool chains, new failure modes, and more ways to be subtly wrong. The eval suite did not know that yet.

New Release Quality Gate

What fixed it

Every new agent behavior now needs one of three artifacts before launch:

a scenario eval
a contract test
a runtime metric with an alert threshold

No artifact, no release. Very boring. Very effective. The best production systems often feel like a stern accountant quietly saving your weekend.

What changed culturally

The biggest shift was not technical. We stopped asking, "Did the model answer?" and started asking, "Can we explain the run?"

That changed design reviews. Instead of arguing about model brands, we reviewed:

what evidence entered context
which tool contracts were trusted
why the agent chose a retry versus fallback
what metric would catch this failure next time

This is less glamorous than swapping model providers. It also works.

Metrics after enforcement

Six weeks later:

task completion: 71% → 86% on sampled workflows
p95 latency: 18.4s → 9.7s
repeated-failure loops: 11% → 2.8%
manual escalation: 29% → 18%
incident review action items with tests/guards: under half → 100%

Treat those numbers as directional, not universal truth. Different workflows move differently. But the pattern was consistent: fewer spooky failures, fewer expensive loops, more boring success.

What we still have not solved

nuanced domain scoring still needs humans in the loop
long-horizon memory across sessions is still easy to fake and hard to audit
multi-tool degradation attribution remains messy when two dependencies fail at once

Anyone claiming those are solved is probably selling a platform deck with a suspicious number of gradients.

Final takeaway

The most useful agent work in Q2 was not "make the model smarter." It was making the system less unserious: fresh retrieval, strict contracts, bounded retries, and evals tied to shipped behavior.

That is the conservative AI take I keep coming back to: move fast, but put rails on the parts that can quietly cost you money. Demos reward magic. Production rewards receipts.

#Agents #Architecture #Reliability #Postmortem #LLMOps

Back to all posts

AI Architecture10 min1k views

The Saturday I Decided a Factory Needed a Knowledge Graph

One weekend, a wild idea, and an air-gapped knowledge graph for an industrial manufacturer that didn't trust the cloud — a field story about building self-improving agents where no data is allowed to leave the building.

June 22, 2026Read more →

AI Architecture1 min1k views

Retrieval Freshness Beats Bigger Models

Teams over-invest in model upgrades while stale retrieval quietly destroys answer quality. Fresh evidence often beats a larger checkpoint.

April 20, 2026Read more →

AI Architecture2 min1k views

Your Context Window Is Not a Memory System

Long-context models tempt teams to treat the prompt as a database. That works until you need auditable state, incremental updates, and retrieval that survives a page refresh.

April 6, 2026Read more →

The scorecard before we got serious

Q2 Agent Failure Mix Before Fixes

Failure mode 1: stale retrieval made good prompts look bad

What failed first

What fixed it

Failure mode 2: schema drift at tool boundaries

What fixed it

Failure mode 3: retry loops inflated cost and latency

Retry Control Logic After Q2 Incidents

What fixed it

Failure mode 4: eval coverage lagged feature velocity

New Release Quality Gate

What fixed it

What changed culturally

Metrics after enforcement

What we still have not solved

Final takeaway

Related Articles

The Saturday I Decided a Factory Needed a Knowledge Graph

Retrieval Freshness Beats Bigger Models

Your Context Window Is Not a Memory System