Every engineering team ships a RAG v1 that demos well and degrades quietly. Five failure modes make that happen: token-count chunking, cosine-only ranking, context-free indexes, absent evals, and stale ingestion. That post names them at altitude. This one is the 500-foot answer. Retrivon is a production enterprise knowledge platform we built and shipped, where 100% of answers carry source citations and no PR merges without a retrieval regression check. Below are the exact decisions we made, the numbers that forced each call, and the blueprint you can apply to your own internal knowledge system.
The constraints that set every architectural decision
Retrivon is not a general web-search system. It ingests manuals and PDFs, SOPs and policies, support documentation, and internal wikis, all from a single enterprise tenant. Correctness per answer matters more than raw throughput. Before a line of code was written, the engagement had four non-negotiables: answers must cite their source document, wrong answers must be detectable, the index must reflect the live source of truth at all times, and the system must deploy privately within the customer's infrastructure boundary.
Everything below was forced by one of those four. The constraints came from procurement and legal as much as from engineering. That is the context that makes the decisions legible. Source transparency is what drives enterprise adoption: it is not a feature anyone requested; it is the condition under which a contract gets signed.
Chunk strategy: designed, not defaulted
The failure mode in every audit is answers that cite the right document but the wrong sentence. A 512-token splitter cuts policy clauses mid-thought, and the model quotes half a list with full confidence. The fix is to make the chunk boundary a decision, not a library default.
To make that concrete: the naive splitter on Retrivon's SOP corpus produced a chunk that read "must be completed before the next stage. Failure to do so results in escalation to the compliance team." It contained no heading, no document name, no procedure number. A retrieval query for "what triggers a compliance escalation" ranked it third -- the cosine score was decent -- but the generated answer cited a procedure fragment with no actionable anchor. The structure-aware chunker on the same corpus produced a chunk beginning with "From Vendor Onboarding SOP v3.1, section 4.2 (Stage Gate Approval): the stage gate approval form must be completed before the next stage. Failure to do so results in escalation to the compliance team." The same query now ranked it first, and the generated answer cited section 4.2 by name. Same underlying text. Different boundary decision.
For Retrivon we use document-structure-aware chunking: split on headings, numbered-list items, and paragraph boundaries first. The token budget is a fallback for unstructured bodies, not the primary rule. This preserves the logical unit the author wrote. A policy clause stays whole. A numbered procedure step does not bleed into the next step.
Context prepending at index time is the second half of the fix. Every chunk gets a one-to-two sentence preamble (source file name, section heading, document type, effective date) written by a small model at ingest time and embedded with the chunk. The raw chunk that reads "revenue grew 3% over the previous quarter" becomes "From Acme Corp Q2 2023 earnings brief, section Revenue Summary: revenue grew 3% over the previous quarter." Simon Willison's writeup of Anthropic's contextual retrieval work names this exact failure mode and puts a number on the solution. Using prompt caching, the one-time cost to generate contextualised chunks is $1.02 per million document tokens. That is an ingest expense, not a per-query cost. It ends the "too expensive" objection in the room.
The measurement gate before shipping: recall@k on 50 labelled queries across a structure-aware namespace and a token-split namespace, same corpus. The number decides. The vibe does not.
Hybrid retrieval and the two-stage reranker
Dense vector search alone is a precision problem waiting to happen. Enterprise knowledge bases are dense with exact terminology: regulation IDs, product codes, version strings, internal codes. A bi-encoder compresses these into a fixed-dimension vector and loses the exact-match signal. BM25 does not. The first retrieval stage in Retrivon is a hybrid index: dense embeddings for semantic similarity, BM25 for exact-term recovery. Neither alone is sufficient.
The second stage is a cross-encoder reranker on the top-50 candidates from the first pass. The bi-encoder scores query and document independently; the cross-encoder scores them together in a single transformer pass, which is why Pinecone's reranker guide describes cross-encoders as receiving "the raw information directly into the large transformer computation, meaning less information loss." More signal per scoring step, fewer documents passing through.
Why top-50 as the rerank window, not top-5 from the first stage? Bi-encoders are lossy. The correct document is often in positions 6-20 of the first-stage results, not the top 5. A rerank window narrower than 50 misses those candidates before the cross-encoder ever sees them.
Second, raising k without reranking is the wrong fix: as the same Pinecone source notes, LLM recall degrades as the context window fills. Fifty chunks before reranking trades a precision problem for a precision-and-cost problem. Reranking to top-5 before the prompt is assembled is what keeps both quality and latency in bounds.
The latency constraint that shaped this call: Retrivon had a sub-two-second end-to-end target. Two-stage retrieval with a top-50 rerank window lands inside it. The alternative (reranking the full index) is not a viable option at any production scale. Pinecone quantifies this directly: running BERT on 40 million records on a V100 GPU takes "more than 50 hours" versus "less than 100ms with encoder models and vector search." The two-stage shape is not an optimisation; it is the only shape that works.
Source-grounded citations: a contract, not a feature
The 100% source-grounded claim in the Retrivon case study is not a marketing metric. It is the architectural contract that allowed the system to be sold to enterprise buyers at all. Procurement and legal do not evaluate retrieval precision. They evaluate the audit trail.
Every chunk in the Retrivon index carries structured metadata: source_id, doc_type, section, and date. The generation prompt is templated around those fields. Each factual clause in the system prompt is tagged with the chunk ID it must draw from. The model is instructed to emit a citation marker inline with each claim, keyed to the chunk ID. If a claim does not map to a retrieved chunk, the prompt structure surfaces that gap before the response is assembled.
That gap triggers suppression, not a disclaimer. The claim is not returned with a hedge like "this may be inaccurate." It is not returned at all. The answer either cites a retrieved chunk or the claim is dropped from the response. In practice, the suppression rate on well-indexed corpora is low -- well under 5% of generated claims -- but the rate is monitored and logged. A spike in suppression signals either an index freshness problem or a query distribution shift, both of which are actionable.
The operational safeguards layer (all queries logged, source retrieval tracked, access patterns monitored) is what makes the citation guarantee defensible under audit. The same retrieval logs that prove provenance to a customer also let the engineering team compute which source documents are being cited most and least, and flag when a high-confidence answer is grounded in a document that was last updated eighteen months ago.
The eval harness: the measurement ships before the feature
The eval harness for Retrivon follows the same operating discipline described in /blog/production-ai-checklist: the measurement contract is in place before any architectural change ships. The harness is not a QA step at the end; it is a CI gate on every PR.
The eval set contains 150 labelled queries, generated from the corpus using the HuggingFace RAG evaluation cookbook approach. Factoid QA pairs are generated from the chunks, then filtered through three critique agents for groundedness, relevance, and standalone-ness, keeping only score-4-or-better on all three. The set is tagged by query type: fresh, ambiguous, multi-hop, and policy. The policy slice is the one that catches the most regressions.
Two metrics run on every PR: nDCG@5 on retrieval, and an LLM-judge score (1-5 scale) on groundedness and relevance for generation. A drop in nDCG@5 of more than 3 points on any tag slice blocks merge. The eval set is the contract. "It works" with no number behind the verb is not a merge condition.
The practical value shows up in the issues the harness caught before customers saw them. One concrete example: a re-indexing run introduced a bug in the content-hash comparison logic. Documents whose content had not changed were being re-chunked with fresh chunk IDs, silently invalidating the existing citation-to-chunk mapping. Any answer generated after the re-index would surface a chunk ID that no longer existed in the live index. A customer querying a recently re-indexed policy area would have received a grounded-looking response that cited a ghost chunk. The policy slice of the eval harness caught the regression on the PR that introduced the re-index change. The nDCG@5 drop on policy queries was 6 points, well above the 3-point gate. The CI check failed, the PR did not merge, and the bug was patched before the re-index ran in production.
Index freshness: the failure mode that arrives months after launch
One-shot ingestion works once. It decays from the moment it runs. Deleted documents remain retrievable. New policies are invisible to the index. Customers discover this by citing the deprecated policy back to your support team. That is the moment the staleness problem becomes a customer-trust problem.
Retrivon uses webhook-triggered incremental ingestion off the source system of record. Every chunk carries a content_hash and a source_id. A delete event from the source triggers chunk removal by source_id, not a full re-index. New documents flow through the same chunking and context-prepending pipeline before being added to the index. The index reflects the source of truth because the source system notifies the index on every change, not because a nightly job re-ingests everything and hopes the diff is clean.
Two dashboard metrics prove freshness is live. Index-age p95: for chunks retrieved in the last 24 hours, the 95th percentile of how old the source verification is. Deleted-source-still-retrievable rate: of recently deleted source documents, what fraction still return chunks in retrieval. Both should be near zero.
The retrieval logs that prove source provenance to a customer audit are the same logs that compute index-age p95. One artefact. Two jobs.
The blueprint is stealable
The five failure modes in /blog/rag-mistakes are not theoretical. Each one was an explicit architectural decision in Retrivon. The blueprint:
- Structure-aware chunking with context prepending. Chunk on document structure. Prepend context at index time for ~$1 per million tokens. Measure with
recall@kon labelled queries before shipping. - BM25 + dense hybrid with cross-encoder reranking. First stage captures exact-match and semantic signal. Cross-encoder reranks top-50 to top-5 before the LLM prompt is built. Two-stage is the only shape that hits sub-second latency at scale.
- Citation-validated generation. Every chunk carries source metadata. Every factual claim is grounded to a chunk ID before surfacing. Uncited claims do not surface.
- A labelled eval set in CI. 150 queries, tagged, with
nDCG@5and LLM-judge scores on every PR. A 3-point drop on any slice blocks merge. - Webhook-triggered incremental ingestion. Deletes are first-class events. Index-age p95 and deleted-source-still-retrievable rate are dashboard metrics, not aspirational ones.
Pick the constraint your current system violates most visibly. Stand up the measurement first. Fix it against that number. The full outcomes picture (retrieval speed, citation coverage, enterprise deployment) is in the Retrivon case study.
If you want to pressure-test your current RAG stack against these criteria, talk to us.