AI Products#RAG #Latency #Performance #Architecture

Production RAG Needs Latency Budgets, Not Hope

A RAG answer can be correct and still lose the user. I learned that the boring way: by watching a good retrieval pipeline feel slow enough to be broken.

Misha Lubich

May 15, 20263 min read

Production RAG Needs Latency Budgets, Not Hope

The answer was correct. The user still hated it.

That is the part RAG benchmark decks skip. Offline quality looked solid. Retrieval hit-rate was up. The response cited the right documents. But the interaction felt like waiting for an expensive intern to find Wi-Fi.

The project was a support/research RAG workflow with hybrid retrieval, reranking, generated summaries, and citations. The first version optimized answer quality. The second version optimized the thing users actually experience: time to useful confidence.

The failure

We had stages that each looked fine alone:

query normalization: about 90ms
hybrid retrieval: 300-500ms
reranking: 250-400ms
prompt assembly: 120ms
model generation: 1.4-2.8s
citation formatting and post-processing: 200-350ms

Individually, not tragic. Together, p95 drifted past 4.6s on messy queries. That is not "AI magic." That is a progress bar with a trust problem.

Budget the full path

Start with user tolerance, not model preference. If the product feels broken beyond 2.5 seconds, assign a budget and make every stage fight for its milliseconds.

RAG Latency Budget Example (p95 target: 2500ms)

What we tried that did not fix it

bigger model: better prose, same sluggish feel
more retrieved chunks: slightly better recall, worse latency
heavier reranker: improved edge cases, punished normal queries

Very Silicon Valley arc: we bought more intelligence before admitting the pipeline needed a stopwatch.

What did fix it

candidate set capped by query class
citation streaming after first useful answer span
cached embeddings for repeated/account-level queries
fallback to lightweight rerank for low-risk questions
alerting on p95 budget by stage, not just total latency

After the changes, the same scenario suite moved from 4.6s p95 to roughly 2.7s p95, with no meaningful drop in answer acceptance.

Takeaway

RAG quality is not just correctness. It is correctness delivered before the user emotionally exits the tab. Budget the path, enforce the budget, and make latency visible before it turns a good system into a slow one.

#RAG #Latency #Performance #Architecture #Product

Back to all posts

AI Products4 min1k views

Valuemaxxing, Not Tokenmaxxing: Why My Agents Prefer CLIs

MCP is great until every 'list my inbox' costs a novella of tool schema. Here's the operating rule I use for local agent tools: CLI for bulk, MCP for dialogue, and measure value by minutes saved — not tokens burned.

July 22, 2026Read more →

AI Products10 min1k views

Do You Actually Struggle to Put AI Agents in Your Business? Here's My Workflow

Everyone wants AI agents in their company; almost nobody wants to talk about the boring plumbing that makes them work. Here's the exact workflow I use to take a business from 'wouldn't it be cool if' to something that runs every day.

June 21, 2026Read more →

AI Products8 min1k views

How I Shipped a CRM Enrichment Product Solo (Between Two Other Jobs)

Sales teams pay enterprise SaaS prices for data that goes stale the moment a contact changes jobs. So one quarter I built the lean version myself — real-time-ish enrichment, job-change tracking, and quality maintenance — without the bloated vendor contract.

June 20, 2026Read more →