Skip to content
Back to BlogAutomation

Five RAG failure modes we still find in 2026 audits

May 13, 202610 min read

RAG is a mature pattern. The systems we audit in 2026 are not. The same five errors keep showing up at companies that already shipped a v1: chunking by token count, retrieval scored on cosine alone, chunks indexed without metadata, no eval set behind the words "it works", and an index that has not seen the source of truth in weeks. The cost is not wrong answers; it is that the team cannot tell when answers are wrong, cannot prove a change helped, and cannot defend "better" to a CFO. Each failure mode below comes with the same triplet: the symptom we look for, the fix we ship, and the cheapest measurement that proves the fix worked.

Chunking by token count is a defaulted decision, not a designed one

The symptom is loud once you look for it. Answers cite the right document but the wrong sentence. Long policy PDFs come back as fragments that read like ransom notes. A 512-token splitter cuts the second half of a list off the end of one chunk and onto the start of the next, and the model confidently quotes half a list. Pinecone's chunking guide names the bar: "If the chunk of text makes sense without the surrounding context to a human, it will make sense to the language model as well." Most 512-token splits fail that test on their own corpus. Nobody on the team has read enough chunks to know.

We see this because chunk size is the first decision most v1 teams make, the first decision they outsource to a defaulting library, and the last decision they revisit. The same Pinecone guide is blunt: "There's no one-size-fits-all solution to chunking, so what works for one use case may not work for another." Databricks corroborates from a different angle: "Finding the proper chunking method is both iterative and context-dependent."

  • Symptom: answers cite the right document but the wrong sentence; lists arrive in halves; 512-token defaults split policy clauses mid-thought.
  • Fix: if you can re-index, switch to recursive or document-structure-aware chunking (split on headings, lists, paragraph boundaries before falling back to a token budget). If re-indexing is expensive, retrieve neighbours — Pinecone calls this chunk expansion, "an easy way to post-process chunked data from a database, by retrieving neighboring chunks." Zero index-time cost, one extra round-trip at query time.
  • Measurement: recall@k on 50 labelled queries across two chunkers, in two namespaces, on the same corpus. The HuggingFace evaluation cookbook puts it plainly: "tuning the chunk size is both easy and very impactful." One afternoon of labelling. Two namespaces. One number per chunker. The decision stops being a vibe.

Cosine similarity is a candidate generator, not a ranker

We see this on every audit where retrieval k > 10. Top-1 looks plausible. Top-5 contains the right answer. Top-1 is rarely the best of the five. Teams compensate by raising k, putting twenty or fifty chunks in the context window, and hoping the model picks. It does not. Pinecone is direct about why: "LLM recall degrades as we put more tokens in the context window." Top-50 without a reranker makes generation worse, not better. You traded a precision problem for a precision-and-cost problem.

The mechanism is simple. A bi-encoder compresses every document and every query into a single vector. That is what makes vector search fast. It is also what makes vector search lossy. A cross-encoder takes the query and the document together, in one transformer pass, and scores the pair. The same Pinecone explainer: cross-encoders "receive the raw information directly into the large transformer computation, meaning less information loss." More signal, more compute, fewer documents.

  • Symptom: raising k to 20 or 50 chunks; top-1 plausible but rarely best; latency creeping up while answer quality plateaus.
  • Fix: two-stage retrieval. bi-encoder pulls 50 candidates, cross-encoder reranks to top-N = 5. Latency stays bounded because the expensive model only sees 50 pairs, not your whole index. Pinecone quantifies the alternative: reranking 40 million records with BERT on a V100 would take "more than 50 hours" against "under 100ms with encoder models and vector search." Never run the full-corpus rerank. Always run the top-N one.
  • Measurement: nDCG@5 with and without rerank on the same labelled queries you built for chunking. Pinecone's RAG evaluation guide prescribes nDCG as the headline retrieval metric. One dashboard line. Two builds. The reranker either earned its place or it did not.

No metadata means freshness and provenance are a coin flip

The symptom looks innocuous in a chunk viewer and catastrophic in production. A chunk reads "revenue grew by 3% over the previous quarter." No company. No quarter. No filing. Cosine treats it as equal to every other 3% sentence in the corpus, and the model picks one. Simon Willison's writeup of Anthropic's contextual retrieval pattern uses that exact example: the naked chunk "doesn't specify which company it's referring to or the relevant time period." That post is from September 2024. The technique has only become cheaper since, and the failure mode has not moved.

  • Symptom: chunks with bare numbers, undated quotes, ambiguous entity references; same-shape facts collapsing into one in retrieval; the model picking between two near-identical sentences from different years.
  • Fix: two halves, and most teams do one. First, structured metadata on every chunk: source, date, doc_type, author, and any tenancy key your access model needs. Retrieval uses these as filters, not as decoration. Databricks calls this "essential for optimal performance" and groups it with the indexing step, not the optional-enrichment step. Second, contextual prepending: an LLM writes a one-line context per chunk at index time, embedded with the chunk, so the vector itself knows that the 3% is Acme's Q3 2025 revenue. Cost is bounded by prompt caching — Willison's writeup quotes Anthropic at $1.02 per million document tokens as a one-time indexing cost. That is the line that ends the "too expensive" objection in client conversations.
  • Measurement: recall@k on a slice of your eval set that requires disambiguation: same metric, different quarter; same entity, different year; same policy clause, different jurisdiction. Below 0.8 on that slice and metadata is your bottleneck. Above 0.95 and you can move on. 20 queries that share a metric or entity name is enough to embarrass a metadata-blind index.

Logging matters here too. If your retrieval calls do not record which chunks came back and which metadata fields filtered them, you cannot debug a bad answer after the fact. We make the case for retrieval-level logs as a governance artefact in /blog/audit-logs-ai; the short version is that the index is the surface the regulator will ask about, not the model.

"It works" is a vibe; an eval set is a contract

The symptom is the demo culture every v1 team falls into. Every PR ships with a screenshot of five favourite queries. Pass rate on favourites is 100%. Pass rate on anything else is unknown. Someone asks "did the last change make it better?" and the only honest answer is "the five queries still pass." Pinecone's evaluation guide names the gap: "your RAG pipeline is only as performant as your retrieval phase is accurate," and accuracy is a number, not a feeling.

  • Symptom: PR screenshots of five favourite queries; no answer to "did the last change help?"; "it works" with no number behind the verb.
  • Fix: put a contract in the repo. 100-200 labelled queries with relevance judgments, retrieval and generation graded separately, the same eval set run on every build. Retrieval grades on nDCG@5. Generation grades on an LLM-judge rubric with a defined 1-5 scale. The HuggingFace cookbook says it plainly: "If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples."
  • Measurement: the delta in nDCG@5 on the same eval set across two builds is the answer to "did that change help?" If you cannot compute it, you do not know.

The eval set itself is a small data structure. Keep it boring.

type EvalQuery = {
  id: string;
  query: string;
  // Doc-and-chunk ids that should appear in the retrieved set.
  relevantChunkIds: string[];
  // Optional: graded relevance for nDCG@5. 0-3 is plenty.
  grades?: Record<string, 0 | 1 | 2 | 3>;
  // Optional: the answer a good system would give. Used by the judge.
  referenceAnswer?: string;
  // Tags for slicing retrieval scores per slice.
  tags: string[]; // e.g. "fresh", "ambiguous", "multi-hop", "policy"
};

Store it as JSON or YAML next to the retriever. The CI job loads it, runs retrieval, computes nDCG@5 per tag, runs generation, asks the judge for a 1-5 score on groundedness and relevance, and writes both numbers to a dashboard. If the numbers move down, the PR does not merge. The eval set is the contract.

The common objection from engineering leaders is "we have no labelled data." The HuggingFace cookbook is the cheapest answer: generate factoid QA pairs from your own chunks with a strong model, then filter with three critique agents (groundedness, relevance, standalone) on a 1-5 scale, and keep only score-4-or-better on all three. Plan for about half the generated set to fail the filter. The cookbook recommends generating ">200 samples" because critique filtering removes "around half of these." It is an afternoon of work, not a quarter.

This is the same operating discipline we lay out in /blog/production-ai-checklist: the measurement ships before the system.

A static index is a screenshot of the company in March

The symptom is the one customers find for you. Answers cite a policy that was edited last week, a product page that was deprecated last month, or a wiki page someone deleted in February. Deletes never propagate. New docs land in SharePoint and the index has never seen them. The team ran a one-shot ingestion script during the v1 push, the script worked, and nobody rebuilt the wiring for ongoing change. Databricks names the default explicitly: "Databricks recommends you ingest data in a scalable and incremental manner." Incremental ingestion is the default in their docs, not the upgrade.

  • Symptom: customers cite deprecated policies back to support; deleted docs still retrievable; new wiki pages invisible to the index for weeks.
  • Fix: point the index at the source of truth and let change flow through. Triggered or continuous ingestion off the system of record. Deletes treated as first-class events, not skipped because "upsert is easier." Every chunk carries a content_hash and a source_id so a downstream change can find it and remove it. If the source is a database table or a CMS, use a sync mechanism (Delta Sync, CDC streams, webhooks) rather than rerunning a full job nightly. If the source is a file store, watch for moves and deletes, not just creates.
  • Measurement: two numbers on a dashboard. Index-age p95: for the chunks retrieved over the last 24 hours, what is the 95th percentile of "how long ago was this chunk's source last verified?" Deleted-source-still-retrievable rate: of a small sample of recently deleted source documents, how many still return chunks in retrieval? Both should be near zero. If you cannot compute either today, you do not have an ingestion pipeline; you have a one-shot load with a fresh date on it.

Retrieval logs make both metrics cheap. The same logs that prove provenance to a regulator are the logs that compute index age. One artefact, two jobs.

Ship the measurement before the fix

Pick the failure mode your team would be most embarrassed to find in next week's audit. Stand up the measurement first. The eval set from the HuggingFace cookbook recipe, the nDCG@5 dashboard from Pinecone's evaluation guide, or the index-age p95 metric from the Databricks pipeline doc. Ship the fix against that number. Same operating discipline as /blog/2026-05-12-your-agent-passes-the-benchmark-it-will-fail-in-production: the measurement ships before the change.

Skip the five mistakes entirely

Every fix above is real work: re-chunking, standing up a reranker, backfilling metadata, building an eval set, wiring incremental ingestion. We built Retrivon so you do not have to do it five times. It is our managed RAG service — structure-aware chunking, two-stage retrieval with reranking, metadata filtering, and incremental ingestion are the defaults, and recall@k, nDCG@5, and index-age p95 ship on a dashboard from day one.

References