MLOps#Agents #Tool Calling #Reliability #Observability

Silent Tool Failures Are the Quiet Killer of Agent Reliability

Q: When the happy path lies?

Toolcalling stacks often assume success because the HTTP status was 200 and the JSON parsed. The database can still reject the write, the idempotent key can collide, or the side effect can land in the wrong tenant. The LLM layers a plausible narrative on top either way. That is not "hallucination" in the poetic sense—it is missing contracts between the model and the outside world.

Q: What production teams do differently?

They make every tool return a structured outcome: intent, entity ids, mutation receipts, and explicit failure modes the model is allowed to reason about. Retries are timed and bounded. Idempotency tokens propagate across steps. Traces link a uservisible answer to raw tool payloads, not just to the assistant's polished summary.

The model says the row was updated. The audit log disagrees. Until you treat tool I/O like distributed systems, agents will keep shipping confident lies.

Misha Lubich

April 5, 20261 min read

Silent Tool Failures Are the Quiet Killer of Agent Reliability

When the happy path lies

Tool-calling stacks often assume success because the HTTP status was 200 and the JSON parsed. The database can still reject the write, the idempotent key can collide, or the side effect can land in the wrong tenant. The LLM layers a plausible narrative on top either way.

That is not "hallucination" in the poetic sense—it is missing contracts between the model and the outside world.

What production teams do differently

They make every tool return a structured outcome: intent, entity ids, mutation receipts, and explicit failure modes the model is allowed to reason about. Retries are timed and bounded. Idempotency tokens propagate across steps. Traces link a user-visible answer to raw tool payloads, not just to the assistant's polished summary.

Instrumentation you cannot skip

Per-tool latency and error rate SLOs
Schema validation on responses before they enter the model again
Canary prompts that assert dangerous operations are impossible without a human gate

Takeaway

If you would not ship a microservice without tracing, do not ship an agent without it. Silent tool failures are how AI features earn trust—and quietly lose it.

#Agents #Tool Calling #Reliability #Observability #Production

Back to all posts

MLOps3 min1k views

AI Evaluation Is the Hardest Unsolved Problem in Engineering

We've gotten incredibly good at building AI systems. We're still terrible at knowing whether they actually work. Evals are the bottleneck nobody's fixing.

September 1, 2025Read more →