What is chunking?
Chunking is the process of splitting documents into smaller pieces for storage and retrieval in a RAG system. When a user asks a question, the system searches these chunks to find the most relevant passages, then feeds them to the language model to generate an answer.
The raw input might be a 200-page PDF, a collection of Word documents, or thousands of knowledge base articles. You can't embed an entire document as a single vector (the meaning gets diluted) and you can't send everything to the LLM (context windows have limits and cost money). Chunking is the bridge between your documents and effective retrieval.
Why chunking matters
Chunking is one of the highest-leverage decisions in a RAG system. Get it right and retrieval is accurate. Get it wrong and the system returns irrelevant passages, misses the right answer, or gives the LLM confusing context.
A chunk that's too large dilutes the signal. It might contain the answer, but also paragraphs of unrelated text that confuse the embedding and the model. A chunk that's too small loses context. The sentence with the answer doesn't make sense without the paragraph around it.
Chunking methods
Fixed-size chunking
Split text into chunks of a fixed number of tokens or characters. Simple to implement. Works reasonably well for homogeneous documents (all similar structure and length).
The problem: it ignores document structure. A chunk might start in the middle of a sentence and end in the middle of another. Paragraph boundaries, section headings, and logical breaks are all lost.
Recursive character splitting
Split by a hierarchy of separators: first by double newlines (paragraphs), then by single newlines, then by sentences, then by words. This preserves structure better than fixed-size while still targeting a specific chunk size.
This is the default in frameworks like LangChain and works well as a baseline. It's fast, predictable, and good enough for many use cases.
Semantic chunking
Use the embedding model itself to detect where meaning shifts. Adjacent sentences with similar embeddings stay together. When the embedding distance between consecutive sentences exceeds a threshold, a new chunk starts.
More computationally expensive, but produces chunks that align with actual topics and ideas rather than arbitrary character counts. Best for documents with varied content density.
Document-structure-aware chunking
Use the document's own structure: headings, sections, tables, lists. Each section becomes a chunk (or is split further if too large). This preserves the author's intended organisation.
Works especially well for structured documents like SOPs, policies, technical manuals, and legal documents. Requires a parser that understands the document format (HTML, Markdown, PDF structure).
Agentic/contextual chunking
Use an LLM to read each chunk and generate a contextual summary or label, then prepend that to the chunk before embedding. The chunk "The temperature must not exceed 45°C" becomes "Safety requirement for cold storage facility: The temperature must not exceed 45°C." This enriches the embedding with context that the raw text alone might not carry.
Choosing chunk size
There's no universal best size. It depends on your embedding model, your documents, and the types of queries you expect.
General guidelines
- 200-500 tokens: Good for precise factual retrieval. "What is the melting point of X?" The answer is in a specific sentence, and small chunks make it easier to find.
- 500-1000 tokens: Good for questions that need context. "Explain the approval process for capital expenditure." The answer spans several paragraphs.
- 1000-2000 tokens: Good for complex topics that need extensive context. Risk: larger chunks are less precise for simple factual questions.
The trade-off
| Smaller chunks | Larger chunks |
|---|---|
| More precise retrieval | More context per chunk |
| Higher recall (find the right passage) | Fewer chunks needed per query |
| Risk: lose surrounding context | Risk: dilute the signal with irrelevant text |
| More chunks to store and search | Fewer chunks but less granular |
Start with 500-800 tokens and test. Adjust based on retrieval quality for your specific queries.
Overlap strategies
Overlap means including some text from the end of one chunk at the beginning of the next. If your chunk size is 500 tokens and overlap is 50, the last 50 tokens of chunk N appear at the start of chunk N+1.
Why? Because important information often falls at chunk boundaries. Without overlap, a sentence split across two chunks might not be retrievable at all. Overlap ensures continuity.
Typical overlap: 10-20% of chunk size. Too much overlap wastes storage and slows search. Too little defeats the purpose.
Metadata enrichment
A chunk alone is just text. Metadata makes it useful:
- Source document. Which file or page this came from. Essential for citations.
- Section heading. What section of the document this belongs to. Helps users navigate to the full context.
- Page number. For PDFs and long documents. Users need to verify the answer.
- Document type. Policy, SOP, manual, email, meeting notes. Enables filtering by document category.
- Date. When the document was published or last updated. Critical for time-sensitive information.
- Access level. Who should be able to see this chunk. Enables role-based access in multi-tenant systems.
Metadata also enables hybrid search: combine semantic similarity with metadata filters. "Find the safety procedure (document type = SOP) for conveyor belts (semantic match) from the last 12 months (date filter)."
Evaluating chunk quality
The only real measure of chunking quality is retrieval quality. Does the system return the right chunks for a given query?
- Build a test set of 50-100 question-answer pairs with known source passages.
- Run the queries and check: is the correct passage in the top 3-5 retrieved chunks?
- Measure recall@k (what percentage of correct passages appear in the top k results).
- Iterate: change chunk size, method, or overlap, re-run, and compare.
This is unglamorous work, but it's the difference between a RAG system that works and one that frustrates users.
Practical tips
- Pre-process before chunking. Strip headers, footers, page numbers, and irrelevant boilerplate. These add noise to embeddings.
- Handle tables separately. Tables are structured data. Chunk them as complete units or convert to text descriptions. Don't let a table get split across chunks.
- Treat images and diagrams deliberately. Either exclude them (if text-only is enough) or use vision models to generate text descriptions that get chunked alongside the surrounding text.
- Different document types may need different strategies. SOPs chunk well by section. Emails chunk well by message. Legal contracts need clause-level chunking. Don't force one approach on everything.
- Re-chunk when you change embedding models. Different models have different optimal token ranges. If you switch from a model that handles 512 tokens well to one optimised for 8192, your chunking should change too.
FAQ
What's the best chunking method?
There isn't one. Document-structure-aware chunking performs best for well-structured documents (SOPs, policies, manuals). Semantic chunking works better for varied or loosely structured content. Recursive character splitting is a solid default when you're starting out.
Does chunk size affect LLM costs?
Yes. Larger chunks mean more tokens sent to the LLM per query, which means higher API costs. If you retrieve 5 chunks of 1000 tokens each, that's 5000 tokens of context per query. At scale, this adds up. Smaller chunks reduce per-query cost but may require retrieving more chunks for complex questions.
Can I use multiple chunk sizes?
Yes. Some systems index the same document at multiple granularities: sentence-level for precise factual retrieval and section-level for broader context. The retrieval system searches both and merges results. This "multi-scale" approach adds complexity but can improve retrieval quality for diverse query types.
How do I handle scanned PDFs?
Scanned PDFs need OCR (optical character recognition) before chunking. The quality of your OCR directly affects everything downstream. Use a good OCR engine, clean the output, and spot-check results. Poor OCR produces chunks full of garbled text that embeddings can't make sense of.
Key takeaways
- Chunking quality directly determines retrieval quality. Bad chunks produce bad answers.
- There is no universal best chunk size. It depends on your documents, your queries, and your embedding model.
- Semantic chunking (splitting by meaning) outperforms fixed-size splitting for most document types.
- Metadata (source document, section heading, page number) is as important as the chunk text itself.