How RAG Works: A Step-by-Step Walkthrough

A step-by-step walkthrough of retrieval-augmented generation. From document ingestion to answer generation, how each stage works and why it matters.

Kasun Wijayamanna Founder & Lead Developer Postgraduate Researcher (AI & RAG), Curtin University - Western Australia

Published 5 February 2025 Updated 1 June 2025

The RAG pipeline at a glance

A RAG system works in two phases. The offline phase prepares your documents for search. The online phase answers questions in real time. Both phases share the same embedding model, which is what makes the system work.

Let's walk through each step.

Step 1: Document ingestion

Before the system can answer questions about your data, it needs to read your data. Ingestion handles the conversion from whatever format your documents are in (PDF, Word, HTML, email, spreadsheet) into clean, searchable text.

This step is more complex than it sounds. PDFs are notoriously difficult. They're a visual format, not a text format. Tables, headers, multi-column layouts, and scanned documents all need different handling.

A typical ingestion pipeline:

File type detection and routing
Text extraction (with OCR for scanned documents)
Table and structure preservation
Metadata extraction (title, author, date, department)
Cleaning: removing headers, footers, page numbers, watermarks

Common pitfall: Skipping proper ingestion and feeding raw PDF text into the system. The garbage-in, garbage-out principle applies hard here.

Step 2: Chunking

Once you have clean text, you split it into chunks, smaller pieces that can be individually indexed and retrieved. The chunk is the unit of retrieval. When someone asks a question, the system finds the most relevant chunks and passes them to the language model.

Why chunk? Language models have context limits, and not every part of a 50-page document is relevant to every question. Chunking lets the system find just the relevant passages.

Chunking strategies:

Fixed-size: Split every N tokens with M tokens of overlap. Simple and predictable.
Sentence-based: Split on sentence boundaries. Preserves natural meaning units.
Semantic: Use the embedding model to detect topic shifts and split accordingly. More expensive but better quality.
Hierarchical: Create chunks at multiple levels (paragraph, section, document) and search across levels.

For most business use cases, sentence-based chunking with 500–800 token windows and 100 token overlap is a solid starting point.

Step 3: Embedding

Each chunk is converted into a vector, a list of numbers that represents its meaning. The embedding model maps text into a high-dimensional space where similar concepts are close together.

"How do I apply for leave?" and "What's the annual leave policy?" are different strings but similar in meaning. A good embedding model places them near each other in vector space, so when the system searches for one, it finds content relevant to both.

Embeddings are generated once during ingestion and stored in the vector database alongside the original text and metadata. At query time, the user's question is also embedded using the same model.

Step 4: Retrieval

This is where the magic happens. The system takes the query embedding and searches the vector database for the most similar chunk embeddings. The top-k results (usually 3–10) are returned as context for the language model.

Retrieval quality directly determines answer quality. If the system retrieves the wrong chunks, even the best language model can't produce a good answer.

Ways to improve retrieval:

Hybrid search: Combine vector similarity with keyword matching (BM25). Catches both conceptual and exact-term matches.
Re-ranking: Score the top-k results with a cross-encoder model that considers the query and each chunk together. More accurate but slower.
Query expansion: Rephrase the user's question in multiple ways and search with all variations.
Metadata filtering: Narrow results by department, document type, date range, or access level before vector search.

Step 5: Generation

The retrieved chunks are assembled into a prompt along with the user's question and a system instruction (e.g., "answer using only the provided context"). The language model generates a response grounded in that context.

The system prompt typically includes:

Role and behaviour instructions
The retrieved context passages
The user's question
Output format and citation requirements
Instructions for when the answer isn't in the context ("say you don't know")

Post-generation, the system may add source citations, validate the response against the retrieved context, or apply additional guardrails before returning it to the user.

Putting it all together

The full flow looks like this:

Your documents are processed, chunked, embedded, and stored (offline, happens once per document)
A user asks a question
The question is embedded and searched against the vector database
The most relevant chunks are retrieved
The chunks + question are sent to the LLM
The LLM generates a sourced answer
The answer is validated and returned to the user

Want the full architecture breakdown? Read RAG Systems Explained for component choices, trade-offs, and deployment considerations.

Key takeaways

RAG has five core stages: ingest, chunk, embed, retrieve, generate.
Each stage introduces quality trade-offs. The output is only as good as the weakest link.
Chunking and retrieval are where most quality issues originate, not the language model.
You can start with a simple pipeline and add sophistication (re-ranking, hybrid search) later.

Postgraduate Researcher (AI & RAG), Curtin University - Western Australia

View profile →

How RAG Works: A Step-by-Step Walkthrough

The RAG pipeline at a glance

Step 1: Document ingestion

Step 2: Chunking

Step 3: Embedding

Step 4: Retrieval

Step 5: Generation

Putting it all together

Key takeaways

Related articles

RAG Systems Explained

Vector Databases Explained

What Is RAG?

Ready to discuss your project?