Token Counting for LLMs: The Complete Guide

Token counting is the foundation of working efficiently with large language models. Every input and output you send to a model is measured in tokens — not words, not characters — and understanding the difference directly affects your costs, context usage, and response quality. This guide covers how tokenisation works, why it varies between models, and practical strategies for managing token budgets in production applications.

What is a Token?

A token is a chunk of text that a language model processes as a single unit. Tokens are not the same as words: a single word may be one token, two tokens, or more, depending on the model's vocabulary and whether the word is common or rare. In English, common short words like "the" and "is" are typically one token, while longer or uncommon words are split across multiple tokens. Punctuation, spaces, and special characters are often their own tokens. As a rough heuristic, 1 token ≈ 0.75 words in English, so 1,000 tokens is approximately 750 words.

How Tokenisation Works

Modern LLMs use subword tokenisation algorithms such as Byte-Pair Encoding (BPE) or SentencePiece. These algorithms start with individual characters and iteratively merge the most frequently occurring adjacent pairs, producing a vocabulary of common subword units. The result is that frequent words and word stems are single tokens, while rare words are split into common component substrings. OpenAI's GPT models use tiktoken (a fast BPE implementation), Claude uses its own Anthropic tokeniser, and Gemini uses SentencePiece. Each tokeniser has a different vocabulary, so the same text produces different token counts on different models — sometimes varying by 10–20%.

Context Windows and Why They Matter

A model's context window is the maximum number of tokens it can process in a single request, counting both input and output. GPT-4o has a 128,000-token context window; Claude 3.5 Sonnet supports 200,000 tokens. When your input plus the expected output exceeds the context window, the model either refuses the request or truncates the input from the beginning, which can produce incoherent responses. In practice, the effective usable context is lower than the advertised limit: performance often degrades when the context is near-full because the model's attention mechanism struggles to reference information at the edges of very long contexts.

How Token Counts Affect Costs

LLM APIs charge separately for input tokens and output tokens, with output tokens typically 3–5x more expensive than input tokens. For GPT-4o, input costs $2.50 per million tokens and output costs $10 per million tokens (as of early 2026). For a typical chatbot message of 500 input tokens and 200 output tokens, the cost per message is ($0.00125 + $0.002) = $0.00325. At scale, this adds up quickly: 100,000 daily messages would cost $325 per day, or roughly $10,000 per month. Understanding token counts lets you design prompts that minimise unnecessary verbosity while preserving quality.

Counting Tokens in Practice

The most accurate way to count tokens is to use the model provider's official tokeniser library. For OpenAI, the tiktoken library in Python gives exact counts: import tiktoken; enc = tiktoken.encoding_for_model("gpt-4o"); len(enc.encode(text)). For Claude, the Anthropic SDK provides client.messages.count_tokens(). For quick estimates without code, browser-based token counters use word-count heuristics that are accurate to within 10–15%, which is sufficient for cost planning. Always measure actual token counts before deploying a new prompt template to production.

Strategies for Reducing Token Usage

The most effective strategies for reducing token usage are: (1) compress system prompts by removing filler words and using bullet lists instead of prose; (2) use structured output formats like JSON instead of asking for explanations, since JSON is more token-dense; (3) cache frequently used context using the model's prompt caching feature (available in GPT-4o and Claude), which reduces the effective cost of repeated system prompts; (4) implement retrieval-augmented generation (RAG) to fetch only the relevant document chunks rather than including entire documents in every request; (5) set max_tokens on API calls to prevent runaway output costs.

Try These Tools

FAQ

Why does the same text produce different token counts on different models?
Each model uses its own tokenisation vocabulary, so the same word may map to one token on GPT-4o but two tokens on Claude. The difference is usually 5–15% for typical English text. Always use the correct tokeniser for the model you are using.
Do code snippets cost more tokens than prose?
It depends on the language. Python code with many common English-like keywords is fairly token-efficient. Languages with many special characters (Regex, SQL with many parentheses, complex JSON) tend to be more token-expensive because special characters are often individual tokens.
How many tokens is a typical PDF page?
A typical text-dense PDF page is approximately 400–600 tokens. A 10-page document is roughly 4,000–6,000 tokens, well within the context windows of modern models. Image-heavy pages contribute fewer tokens since images are handled separately as vision inputs.
What is prompt caching and how much does it save?
Prompt caching lets providers store and reuse the computation for a fixed system prompt across multiple requests. Anthropic offers a 90% discount on cached input tokens. For applications with large, stable system prompts (legal documents, code bases), caching can reduce token costs by 60–80%.

Related Guides