Chunking Strategies for RAG
How to split documents for retrieval-augmented generation. Chunking methods, size trade-offs, overlap strategies, and metadata enrichment for better retrieval quality.
How to split documents for retrieval-augmented generation. Chunking methods, size trade-offs, overlap strategies, and metadata enrichment for better retrieval quality.
Chunking is the process of splitting documents into smaller pieces for storage and retrieval in a RAG system. When a user asks a question, the system searches these chunks to find the most relevant passages, then feeds them to the language model to generate an answer.
The raw input might be a 200-page PDF, a collection of Word documents, or thousands of knowledge base articles. You can't embed an entire document as a single vector (the meaning gets diluted) and you can't send everything to the LLM (context windows have limits and cost money). Chunking is the bridge between your documents and effective retrieval.
Chunking is one of the highest-leverage decisions in a RAG system. Get it right and retrieval is accurate. Get it wrong and the system returns irrelevant passages, misses the right answer, or gives the LLM confusing context.
A chunk that's too large dilutes the signal. It might contain the answer, but also paragraphs of unrelated text that confuse the embedding and the model. A chunk that's too small loses context. The sentence with the answer doesn't make sense without the paragraph around it.
Split text into chunks of a fixed number of tokens or characters. Simple to implement. Works reasonably well for homogeneous documents (all similar structure and length).
The problem: it ignores document structure. A chunk might start in the middle of a sentence and end in the middle of another. Paragraph boundaries, section headings, and logical breaks are all lost.
Split by a hierarchy of separators: first by double newlines (paragraphs), then by single newlines, then by sentences, then by words. This preserves structure better than fixed-size while still targeting a specific chunk size.
This is the default in frameworks like LangChain and works well as a baseline. It's fast, predictable, and good enough for many use cases.
Use the embedding model itself to detect where meaning shifts. Adjacent sentences with similar embeddings stay together. When the embedding distance between consecutive sentences exceeds a threshold, a new chunk starts.
More computationally expensive, but produces chunks that align with actual topics and ideas rather than arbitrary character counts. Best for documents with varied content density.
Use the document's own structure: headings, sections, tables, lists. Each section becomes a chunk (or is split further if too large). This preserves the author's intended organisation.
Works especially well for structured documents like SOPs, policies, technical manuals, and legal documents. Requires a parser that understands the document format (HTML, Markdown, PDF structure).
Use an LLM to read each chunk and generate a contextual summary or label, then prepend that to the chunk before embedding. The chunk "The temperature must not exceed 45°C" becomes "Safety requirement for cold storage facility: The temperature must not exceed 45°C." This enriches the embedding with context that the raw text alone might not carry.
There's no universal best size. It depends on your embedding model, your documents, and the types of queries you expect.
| Smaller chunks | Larger chunks |
|---|---|
| More precise retrieval | More context per chunk |
| Higher recall (find the right passage) | Fewer chunks needed per query |
| Risk: lose surrounding context | Risk: dilute the signal with irrelevant text |
| More chunks to store and search | Fewer chunks but less granular |
Start with 500-800 tokens and test. Adjust based on retrieval quality for your specific queries.
Overlap means including some text from the end of one chunk at the beginning of the next. If your chunk size is 500 tokens and overlap is 50, the last 50 tokens of chunk N appear at the start of chunk N+1.
Why? Because important information often falls at chunk boundaries. Without overlap, a sentence split across two chunks might not be retrievable at all. Overlap ensures continuity.
Typical overlap: 10-20% of chunk size. Too much overlap wastes storage and slows search. Too little defeats the purpose.
A chunk alone is just text. Metadata makes it useful:
Metadata also enables hybrid search: combine semantic similarity with metadata filters. "Find the safety procedure (document type = SOP) for conveyor belts (semantic match) from the last 12 months (date filter)."
The only real measure of chunking quality is retrieval quality. Does the system return the right chunks for a given query?
This is unglamorous work, but it's the difference between a RAG system that works and one that frustrates users.
There isn't one. Document-structure-aware chunking performs best for well-structured documents (SOPs, policies, manuals). Semantic chunking works better for varied or loosely structured content. Recursive character splitting is a solid default when you're starting out.
Yes. Larger chunks mean more tokens sent to the LLM per query, which means higher API costs. If you retrieve 5 chunks of 1000 tokens each, that's 5000 tokens of context per query. At scale, this adds up. Smaller chunks reduce per-query cost but may require retrieving more chunks for complex questions.
Yes. Some systems index the same document at multiple granularities: sentence-level for precise factual retrieval and section-level for broader context. The retrieval system searches both and merges results. This "multi-scale" approach adds complexity but can improve retrieval quality for diverse query types.
Scanned PDFs need OCR (optical character recognition) before chunking. The quality of your OCR directly affects everything downstream. Use a good OCR engine, clean the output, and spot-check results. Poor OCR produces chunks full of garbled text that embeddings can't make sense of.
Tell us what you're working on. We'll come back with a practical recommendation and clear next steps.