What is a RAG system?
A RAG system (retrieval-augmented generation) connects a language model to your organisation's data so it can answer questions grounded in real documents rather than general training knowledge. It's the most practical AI architecture for business knowledge applications.
If you haven't read our intro piece, start with What Is RAG? for the basics. This article goes deeper into the architecture.
Core components
Every RAG system has these building blocks:
Document processor
Handles ingestion. Takes your raw documents (PDFs, Word files, HTML, databases) and converts them into text. This might involve OCR for scanned documents, table extraction, or stripping formatting.
Chunking engine
Splits processed text into smaller pieces (chunks) that the retrieval system can index and search. Chunk size, overlap, and boundary strategy all affect quality. Too small and you lose context. Too large and you dilute relevance.
Embedding model
Converts text chunks into numerical vectors (embeddings) that capture semantic meaning. Similar concepts end up as similar vectors. Popular choices include OpenAI's text-embedding-3-large, Cohere Embed, and open-source models like BGE.
Vector database
Stores embeddings and enables fast similarity search. When a query comes in, the vector DB finds the chunks most semantically similar to the question. Options include Pinecone, Weaviate, pgvector, and OpenSearch.
Retrieval engine
Orchestrates the search. Converts the user query to an embedding, queries the vector DB, and optionally applies re-ranking, filtering, or hybrid search (combining vector + keyword search).
Language model (LLM)
Takes the retrieved chunks plus the user's question and generates a natural-language answer. GPT-4, Claude, and Llama are common choices.
Response formatter
Structures the output: source citations, table formatting, stripping unsafe content, or converting to the format your application needs.
The full pipeline
Here's the end-to-end flow:
- Ingest: Documents → processor → chunking → embedding → stored in vector DB (plus metadata)
- Query: User question → embedding → vector DB search → top-k relevant chunks retrieved
- Augment: Retrieved chunks + system prompt + user question → assembled as context for the LLM
- Generate: LLM produces answer grounded in the retrieved context
- Post-process: Add source citations, validate output, apply guardrails, return to user
The bottleneck is almost always retrieval, not generation. If the system retrieves the wrong chunks, the LLM can't save it. Focus your optimisation effort on steps 1–3.
Architecture choices
Chunking strategy
Options range from simple (fixed-size with overlap) to sophisticated (semantic chunking based on topic boundaries). For most business documents, 500–1000 token chunks with 100 token overlap is a solid starting point.
Retrieval method
- Dense retrieval: Vector similarity only. Works well for conceptual questions.
- Sparse retrieval: Keyword-based (BM25). Better for exact terms, codes, product names.
- Hybrid: Combines both. Usually the best choice for business applications.
Re-ranking
After initial retrieval, a re-ranker model scores the top results for relevance to the specific question. Adds latency but significantly improves answer quality. Cohere Rerank and cross-encoder models are popular choices.
Metadata filtering
Tagging chunks with metadata (document type, department, date, access level) lets you filter results before or during retrieval. Critical for multi-tenant systems and access control.
Measuring quality
You can't improve what you don't measure. Key metrics for RAG systems:
- Retrieval precision: Are the retrieved chunks actually relevant?
- Retrieval recall: Are we finding all the relevant chunks?
- Answer faithfulness: Does the generated answer accurately reflect the retrieved content?
- Answer relevance: Does the answer actually address the user's question?
- Hallucination rate: How often does the model add information not in the sources?
Build an evaluation dataset early. Real questions from real users with expected answers. Run it after every change.
Implementation considerations
- Data residency: For Australian businesses, deploy on AWS Sydney (ap-southeast-2) to keep data in-country.
- Access control: Not all users should see all documents. Implement document-level permissions in metadata.
- Update frequency: How often does your data change? Design your ingestion pipeline accordingly.
- Cost modelling: Embedding generation (one-time per document) + vector DB hosting + LLM API calls (per query). Model your expected query volume.
- Start small: Prove value with one knowledge domain before scaling to the whole organisation.
Key takeaways
- A RAG system has three main phases: ingest, retrieve, and generate.
- The quality of your retrieval determines the quality of your answers. Garbage in, garbage out.
- Chunking strategy, embedding model, and retrieval method matter more than the LLM you choose.
- Start simple, measure relentlessly, and add complexity only when the metrics justify it.