Document Chunking Strategies for RAG Applications
How you split documents into chunks is one of the most important decisions in building a RAG (Retrieval-Augmented Generation) application. Poor chunking causes relevant information to be split across chunk boundaries and retrieved incompletely, leading to low-quality answers. This guide covers the main chunking strategies, when to use each, and how to evaluate chunk quality.
Why Chunking Matters
Vector search retrieves the most semantically similar chunks to a query, not the most similar complete documents. If a chunk contains only part of the answer to a query, the retrieved context will be incomplete and the model's response will be inaccurate. Chunking must balance two competing goals: small chunks are retrieved more precisely (a single paragraph about the relevant topic) but may lack surrounding context; large chunks contain more context but are matched less precisely (a 2,000-token chunk may match a query even if only 50 tokens are relevant). The optimal chunk size depends on your document type, query patterns, and the embedding model used.
Fixed-Size Chunking
Fixed-size chunking splits documents at every N tokens regardless of content structure. Typical chunk sizes range from 256 to 1,024 tokens. This approach is simple, predictable, and works well as a baseline. The main drawback is that it frequently cuts sentences or concepts in half at chunk boundaries. Fixed-size chunking is the right starting point for a new project because it requires no linguistic analysis and is language-agnostic. To mitigate boundary effects, add an overlap (10–20% of the chunk size) so each chunk shares some tokens with its neighbours — a 512-token chunk with 64-token overlap means that content near a boundary appears in two chunks and is never completely stranded.
Semantic Chunking
Semantic chunking uses an embedding model to detect topic boundaries by measuring the cosine similarity between adjacent sentences. When the similarity drops below a threshold, a new chunk begins. This produces chunks that correspond to natural topic units rather than arbitrary token counts. Semantic chunks vary in size — a complex technical section may be 800 tokens while a simple definition may be 100 tokens. The approach produces higher retrieval precision but requires running the embedding model over the entire document during indexing, which is slower and more expensive than fixed-size chunking. Use semantic chunking for high-quality corpora (legal documents, technical documentation) where precision matters more than indexing cost.
Recursive and Structure-Aware Chunking
Recursive chunking attempts to split at natural boundaries in order of preference: first at paragraph breaks, then at sentence endings, then at token count. This preserves as much semantic coherence as possible while staying within the target chunk size. Structure-aware chunking extends this concept to documents with explicit structure — HTML, Markdown, code. For Markdown, split at H2 or H3 headings to keep each section in its own chunk. For code, split at function or class boundaries. For PDFs with multiple sections, split at section breaks identified by font size or whitespace patterns. Structure-aware chunking typically produces the highest retrieval precision for well-structured documents.
Chunk Metadata and Enrichment
The chunk embedding carries the semantic meaning of the text, but additional metadata is essential for filtering and ranking. Store with each chunk: the source document name and URL; the section or chapter heading; the page number (for PDFs); the chunk's position within the document (for context ranking); and a summary of the chunk generated by an LLM at indexing time. The summary is especially valuable — searching against a metadata summary rather than raw chunk text often improves precision for queries that use different vocabulary than the source document. Rich metadata enables hybrid search: filter by document type or date range, then rank by embedding similarity.
Evaluating Chunk Quality
The best way to evaluate chunking quality is to measure end-to-end RAG performance against a test set of question-answer pairs from your document corpus. For each question, retrieve the top K chunks and check whether the correct answer is present. Retrieval Recall@K (the percentage of questions where the answer appears in the top K chunks) is the primary metric. An alternative evaluation is "context precision": the percentage of retrieved tokens that are actually relevant to the query. High precision and recall together indicate good chunking. Compare chunking strategies on your specific documents and query patterns — the best strategy varies significantly by domain and document type.
Try These Tools
Split text into token-sized chunks with configurable overlap for RAG and embedding pipelines.
Count tokens for GPT, Claude, Gemini, and LLaMA models.
Full preprocessing pipeline for LLM input: trim, normalize, strip HTML, collapse whitespace, and truncate to context window.
Clean and sanitize text for LLM input by stripping HTML, normalizing Unicode, and collapsing whitespace.
FAQ
- What chunk size works best for most applications?
- 512 tokens with a 10% overlap is a good default starting point. For question-answering over dense technical documents, try 256 tokens for higher precision. For documents with long multi-paragraph explanations, 1,024 tokens may perform better. Always benchmark against your specific query patterns.
- Should I chunk by tokens or by characters?
- Chunk by tokens when using the chunks with an LLM (tokens directly determine cost and context limits). Chunk by characters when using a non-LLM embedding model with character-based input limits. Most production RAG stacks use token-based chunking.
- How do I handle tables and images in documents?
- Tables are best kept as complete chunks regardless of size, since splitting a table breaks its structure. Images should be described using a vision model at indexing time and the description treated as a text chunk. Alternatively, skip images for RAG and note that visual information is not indexed.
Related Guides
Context window management is one of the most important engineering challenges in productio...
Token Counting for LLMs: The Complete GuideToken counting is the foundation of working efficiently with large language models. Every ...
Batch Processing with LLM APIsBatch processing transforms LLMs from interactive tools into data processing pipelines. Wh...