The RAG pipeline at a glance
A RAG system works in two phases. The offline phase prepares your documents for search. The online phase answers questions in real time. Both phases share the same embedding model, which is what makes the system work.
Let's walk through each step.
Step 1: Document ingestion
Before the system can answer questions about your data, it needs to read your data. Ingestion handles the conversion from whatever format your documents are in (PDF, Word, HTML, email, spreadsheet) into clean, searchable text.
This step is more complex than it sounds. PDFs are notoriously difficult — they're a visual format, not a text format. Tables, headers, multi-column layouts, and scanned documents all need different handling.
A typical ingestion pipeline:
- File type detection and routing
- Text extraction (with OCR for scanned documents)
- Table and structure preservation
- Metadata extraction (title, author, date, department)
- Cleaning — removing headers, footers, page numbers, watermarks
Common pitfall: Skipping proper ingestion and feeding raw PDF text into the system. The garbage-in, garbage-out principle applies hard here.
Step 2: Chunking
Once you have clean text, you split it into chunks — smaller pieces that can be individually indexed and retrieved. The chunk is the unit of retrieval. When someone asks a question, the system finds the most relevant chunks and passes them to the language model.
Why chunk? Language models have context limits, and not every part of a 50-page document is relevant to every question. Chunking lets the system find just the relevant passages.
Chunking strategies:
- Fixed-size: Split every N tokens with M tokens of overlap. Simple and predictable.
- Sentence-based: Split on sentence boundaries. Preserves natural meaning units.
- Semantic: Use the embedding model to detect topic shifts and split accordingly. More expensive but better quality.
- Hierarchical: Create chunks at multiple levels (paragraph, section, document) and search across levels.
For most business use cases, sentence-based chunking with 500–800 token windows and 100 token overlap is a solid starting point.
Step 3: Embedding
Each chunk is converted into a vector — a list of numbers that represents its meaning. The embedding model maps text into a high-dimensional space where similar concepts are close together.
"How do I apply for leave?" and "What's the annual leave policy?" are different strings but similar in meaning. A good embedding model places them near each other in vector space, so when the system searches for one, it finds content relevant to both.
Embeddings are generated once during ingestion and stored in the vector database alongside the original text and metadata. At query time, the user's question is also embedded using the same model.
Step 4: Retrieval
This is where the magic happens. The system takes the query embedding and searches the vector database for the most similar chunk embeddings. The top-k results (usually 3–10) are returned as context for the language model.
Retrieval quality directly determines answer quality. If the system retrieves the wrong chunks, even the best language model can't produce a good answer.
Ways to improve retrieval:
- Hybrid search: Combine vector similarity with keyword matching (BM25). Catches both conceptual and exact-term matches.
- Re-ranking: Score the top-k results with a cross-encoder model that considers the query and each chunk together. More accurate but slower.
- Query expansion: Rephrase the user's question in multiple ways and search with all variations.
- Metadata filtering: Narrow results by department, document type, date range, or access level before vector search.
Step 5: Generation
The retrieved chunks are assembled into a prompt along with the user's question and a system instruction (e.g., "answer using only the provided context"). The language model generates a response grounded in that context.
The system prompt typically includes:
- Role and behaviour instructions
- The retrieved context passages
- The user's question
- Output format and citation requirements
- Instructions for when the answer isn't in the context ("say you don't know")
Post-generation, the system may add source citations, validate the response against the retrieved context, or apply additional guardrails before returning it to the user.
Putting it all together
The full flow looks like this:
- Your documents are processed, chunked, embedded, and stored (offline — happens once per document)
- A user asks a question
- The question is embedded and searched against the vector database
- The most relevant chunks are retrieved
- The chunks + question are sent to the LLM
- The LLM generates a sourced answer
- The answer is validated and returned to the user
Want the full architecture breakdown? Read RAG Systems Explained for component choices, trade-offs, and deployment considerations.
Key takeaways
- RAG has five core stages: ingest, chunk, embed, retrieve, generate.
- Each stage introduces quality trade-offs — the output is only as good as the weakest link.
- Chunking and retrieval are where most quality issues originate, not the language model.
- You can start with a simple pipeline and add sophistication (re-ranking, hybrid search) later.