April 2026·6 min read

Building a RAG Pipeline: Patterns That Worked

After experimenting with different RAG architectures, I wanted to write down the patterns that held up. Not the tutorials where everything works on the first try, but the messy parts: cleaning data nobody warned you about, retrieval strategies that need more thought than "just do top-k", and the failure modes that only show up once you start testing seriously.

LangGraphQdrantRAGOpenAIFastAPIPython

This post is a personal exploration of general RAG patterns using publicly available tools and techniques. It does not describe any proprietary system or production architecture.

Context and stack

These patterns apply well to content-heavy corpora: lots of documents, structured data, metadata that matters for filtering. The kind of scenario where naive top-k similarity search falls apart fast.

A stack that works well for this kind of problem: LangGraph for conversation orchestration, Qdrant (or any vector store) for indexing, OpenAI embeddings for vectorization, and FastAPI as the serving layer. The patterns here are transferable regardless of what you're building.

End-to-end pipeline: indexing (top) and serving (bottom)

Ingestion is where most of the work goes

One thing worth knowing: cleaning source documents properly eats a large chunk of total effort. Real-world content comes with inconsistent formatting, embedded metadata, and structural noise. Feed that raw into a text splitter and you get garbage chunks.

What helps is detecting the content format first and running format-specific cleaning before chunking. This sounds like boilerplate work, but bad chunks make everything downstream worse. Retrieval gets noisy, the LLM gets confused, and answers degrade in ways that are hard to trace back to the source.

Chunking that actually helps retrieval

A two-stage approach works well. First, split on document headers to keep structure intact. Then run each section through a recursive character splitter with ordered separators: section breaks, paragraph breaks, lines, sentences, words.

The thing that made the biggest difference was prepending contextual metadata to every chunk. A few lines at the top with enough context for the embedding to distinguish this chunk from similar ones. Without this, generic-sounding chunks all land in the same region of embedding space and retrieval precision drops.

The sweet spot tends to be somewhere in the 2000-4000 character range with about 10% overlap. Smaller chunks improve precision but the LLM gets less context and starts hallucinating details. Bigger chunks are the opposite problem. You have to experiment with your own corpus to find the right balance.

Retrieval: the part that actually matters

This is where most of the interesting engineering lives. Simple top-k similarity search works fine for generic questions, but it breaks down once queries reference specific entities or categories in your corpus. That kind of query needs metadata-aware retrieval, not just vector similarity.

Query rewriting

Before retrieval, the user message gets rewritten by a small LLM call into a better search query. Conversational messages like "yeah what about that?" get expanded using chat history into something self-contained. Obvious idea, but it tends to be the single most impactful change for retrieval quality.

One quirk worth knowing: if your corpus is narrowly scoped, repeated entity names in rewritten queries can act as noise in the embedding space. A post-processing step that normalizes or removes the dominant entity gives a measurable improvement in retrieval relevance.

Entity-aware hybrid retrieval

When a query mentions a specific entity that maps to your metadata (a topic, a category, a keyword), a parallel LLM call can extract it. If it finds one, you can run two searches:

Filtered search (larger k): vector search with a metadata filter scoping to the relevant content type and extracted entity
Unfiltered search (smaller k): regular similarity search across all content

Results are merged and deduplicated. The filtered results come first (higher priority), then general results fill in. This matters because if you only do filtered search, you miss the broader contextual pages. But if you only do unfiltered search, the most relevant domain-specific documents get buried.

Hybrid retrieval: filtered results get priority, general context fills in

Parallel classification: trading money for latency

Most RAG pipelines need to answer several questions about a user message before retrieval: what's the intent? Does this need RAG at all? What entities are present? Is the query ambiguous?

These are all separate LLM calls. Running them sequentially adds up fast — several calls at a few hundred milliseconds each. Running them in parallel with a ThreadPoolExecutor changes the math:

Intent classification (does this need RAG?)
Entity extraction
Query rewriting for search
Ambiguity or follow-up detection

Total latency becomes the time of the slowest call, not the sum. The tradeoff is cost: N calls means N× the tokens per turn. For most use cases that's worth it. For a side project, run them sequentially.

A gotcha with LangGraph specifically: deepcopy the state before passing it to parallel threads. LangGraph state is a mutable dict, and concurrent writes to the same dict produce non-reproducible bugs that only appear under load.

Hallucination prevention: the empty results problem

The worst failure mode in RAG isn't a wrong answer. It's a confidently wrong answer about something the corpus doesn't cover. If the retriever returns nothing, the LLM will fill in the gap from its training data. That's hallucination, and in production it's worse than saying "I don't know."

The pattern that works: when retrieval returns zero relevant documents, skip the LLM entirely and return an honest templated response. No generation, no hallucination risk. The user gets a straightforward "I don't have information on that" and a path to continue.

The retriever is your ground truth. If it finds nothing, don't ask the LLM to improvise.

Graph-based conversation flow

Multi-turn RAG conversations need routing logic after every response. Should the next step be a follow-up question? A handoff? A redirect? LangGraph's StateGraph is good for this because it makes the flow an explicit state machine with conditional edges.

One non-obvious insight: answer the question before routing. If someone asks something, give them the RAG response first, then decide on the next action. The instinct is to route early, but users who get their question answered first are more willing to go along with whatever you suggest next.

Worth trying next

Smaller, more focused chunks with a parent-child retrieval strategy. Large chunks work, but they're a compromise. I'd try storing small child chunks for retrieval precision and fetching the larger parent chunk for LLM context. Several frameworks support this now out of the box.

Evaluation from day one. Setting up an evaluation pipeline early saves a lot of pain. Every time you change a prompt, a chunk size, or a retrieval parameter, you need to know what broke. A judge LLM scoring conversations against a rubric works well for this, but it's the kind of thing that's easy to put off until you've already introduced regressions you can't trace.

Reranking helps more than you'd think. Adding a cross-encoder reranker on top of initial retrieval results noticeably improves relevance, especially for longer queries where the best semantic match isn't the most useful document for the actual question. If you haven't tried reranking, it's a high-impact addition for relatively low effort.

Cache the classification results. The parallel classification calls are the biggest cost driver. Many users ask similar categories of questions. A semantic cache (embedding similarity on the query, return cached classification if it's close enough) could cut costs without hurting quality much.

Wrapping up

Most of the work in a production RAG system isn't the retrieval or the generation. It's cleaning the source data, making metadata useful, handling the cases when retrieval returns nothing, and having an evaluation loop so you notice when things get worse. The LLM call at the end is the easy part.

If you're working through similar problems, feel free to reach out. Always happy to compare notes.