Production RAG Needs Latency Budgets, Not Hope

RAG quality collapses when retrieval, reranking, and generation each add 'just a little' latency. A hard latency budget forces better architecture decisions.

Production RAG Needs Latency Budgets, Not Hope

Many RAG outages are not correctness failures. They are patience failures.

Each stage looks acceptable in isolation: retrieval adds 120ms, reranking adds 180ms, generation adds 900ms. Then network overhead and retries turn an "okay" path into a multi-second experience users abandon.

Budget the full path

Start with user tolerance, not model preference. If your product interaction feels broken beyond 2.5 seconds, design backward from that limit and assign explicit stage budgets.

RAG Latency Budget Example (p95 target: 2500ms)

Design consequences

Once you have hard budgets, architecture decisions become obvious:

  • shrink retrieval candidate set when reranking dominates
  • cache embeddings for repeated queries
  • reduce context payload for low-value sections
  • stream partial responses early for perceived performance

Teams that skip budgets usually overfit for offline quality and lose users in real interaction loops.

Takeaway

If latency is unbounded, quality work is invisible. Budget the path and enforce it in CI and runtime alerts.

Related Articles