Should I chunk documents or use the full 200K context?

For tasks requiring holistic understanding (contracts, legal analysis, book summarization), use the full context — chunking loses cross-document relationships. For retrieval tasks (finding specific facts), chunking with vector search is more cost-effective and scales to arbitrary document sizes.

What is the "lost in the middle" problem?

LLMs tend to recall information at the beginning and end of their context window more reliably than information in the middle. For long documents, put critical instructions at both the start and end of the context, and prefer specific questions that reference explicit document sections.

How does prompt caching work with long documents?

Claude's prompt caching stores the KV cache for a request prefix. Mark your document with cache_control: {type: "ephemeral"} to cache it for 5 minutes. Subsequent requests with the same cached prefix pay only the cache read price, dramatically reducing cost for repeated queries against the same document.

Process a Long Document with Claude 200K Context

Claude 3.5 Sonnet and Haiku both offer a 200,000-token context window, equivalent to approximately 150,000 words or a 600-page book. This makes Claude the preferred model for tasks that require processing entire books, legal contracts, codebases, or lengthy research papers in a single request without chunking. This example shows how to structure a request that includes a long document alongside precise extraction instructions. For long-context tasks, document placement within the request matters. Research shows that LLMs perform best when the question or task appears both before and after the document rather than only at the end. The "lost in the middle" phenomenon — where models perform worse on information in the middle of very long contexts — is less pronounced in Claude than in other models, but still worth mitigating. Place your most important instructions at the start of the system prompt and repeat critical constraints in the final user message. At 200K tokens, prompt caching becomes economically critical. If you are processing many questions against the same document, Claude's prompt caching feature stores the document tokens and charges only $0.30/MTok for cache hits versus $3.00/MTok for full input processing. Structure your requests with the static document before the dynamic question to maximize cache hit rates.

Example

Document: Annual Report 2023 (estimated 45,000 tokens)
System prompt: 680 tokens
User question: 120 tokens
Expected response: 800 tokens

Context window: 200,000 tokens
Model: claude-3-5-sonnet-20241022
Caching: enabled for document content

Calculate: remaining capacity, cache savings at 100 queries, cost comparison vs chunking approach

[ open in Claude Context Window Calculator → ]

FAQ

Should I chunk documents or use the full 200K context?: For tasks requiring holistic understanding (contracts, legal analysis, book summarization), use the full context — chunking loses cross-document relationships. For retrieval tasks (finding specific facts), chunking with vector search is more cost-effective and scales to arbitrary document sizes.
What is the "lost in the middle" problem?: LLMs tend to recall information at the beginning and end of their context window more reliably than information in the middle. For long documents, put critical instructions at both the start and end of the context, and prefer specific questions that reference explicit document sections.
How does prompt caching work with long documents?: Claude's prompt caching stores the KV cache for a request prefix. Mark your document with cache_control: {type: "ephemeral"} to cache it for 5 minutes. Subsequent requests with the same cached prefix pay only the cache read price, dramatically reducing cost for repeated queries against the same document.

Related Examples

Build an Anthropic Messages API Request

The Anthropic Messages API is the primary interface for all Claude models. Unlik...

Count Tokens for a Claude Request

The Anthropic API provides a dedicated token counting endpoint that returns the ...

Calculate Context Window Usage for a System Prompt

Every LLM request draws from a fixed context window budget measured in tokens. Y...

Check Context Window Utilization for GPT-4o

GPT-4o supports a 128,000-token context window — enough for roughly 100,000 word...