RAG in practice: what actually matters

A RAG demo is easy. A RAG system that survives contact with a real corpus is not.

I’ve been building one for Wondercall AI for a while now, against a document set that looks nothing like the clean Wikipedia articles most tutorials use. Here’s what I’ve learned, ranked roughly by how much it mattered.

Ingestion dominates

Most people optimize the model. Most of the wins are in the pipeline before the model ever sees the text.

For a corpus that’s half scanned PDFs, ingestion is the product. OCR quality, table handling, footnote recovery, deduplication of almost-identical documents — each of these moves retrieval quality more than any prompt-level change I’ve tried.

Concretely: switching from naive page-level chunks to layout-aware chunks that respect sections, tables, and lists roughly doubled my downstream precision on narrow queries. No model change. Same vectors, same top-k, same prompt.

Hybrid retrieval, not pure vector

Pure vector search is a tarpit for specific queries. “What’s the penalty in section 4.3” loses to embedding similarity that thinks “4.3” is basically the same as any other number.

The fix isn’t exotic:

Run a BM25 / keyword pass alongside the vector pass.
Merge results.
Rerank the combined set with a cross-encoder.

The last step is where most of the precision lives. Without reranking, the merged list is noisier than either source alone.

Refusals are the feature

An AI search tool that answers “I don’t know” well is more useful than one that answers everything confidently.

This requires two things:

The prompt has to grant explicit permission to refuse when sources don’t support an answer.
The retrieval has to have a confidence floor — below it, the model should refuse rather than grasp.

Grasping is what makes users stop trusting the tool. One confident-and-wrong answer burns more trust than five “I don’t have enough to say on that.”

Citations that link to the span, not the doc

“Source: contract-v4.pdf” is not a citation. It’s a stub.

Useful citations link to the specific page and, ideally, the specific paragraph. Users click through. When the citation is precise, they verify and trust; when it’s just a document name, they don’t bother clicking, and the system is indistinguishable from an LLM hallucinating.

This is an infrastructure problem, not a model problem. It means keeping document offsets through the chunking pipeline, storing them with the vectors, and rendering them in the UI — all things you don’t bother with in a demo and must have in production.

Evaluation is retrieval quality

The end-to-end eval everyone reaches for — “did the final answer look right?” — is too noisy to steer on.

What I actually use:

A fixtures file of queries with known-good source spans.
For each query, compute recall@k on the retrieval step alone.
Track this over time. Regressions here show up as user complaints later, always.

If retrieval is strong, the model will mostly do fine. If retrieval is weak, no amount of prompt engineering saves you.

Boring things, in order

If you’re building RAG over a real corpus, the order of attention that’s worked for me:

Ingestion quality. OCR, chunking, dedup.
Hybrid retrieval with reranking.
Precise citations.
Explicit, generous refusals.
A retrieval-level eval set you can actually regress against.

Model choice is maybe sixth. It matters less than any of the above.

← All writing Book a call →