When should I fine-tune a model instead of engineering the prompt?

Fine-tuning makes sense when you have a large volume of requests with a consistent task pattern (thousands per day), the task is simple enough that a smaller fine-tuned model can match the frontier model's quality, and the upfront fine-tuning cost is recouped within a few months. For most applications, prompt engineering and model selection achieve 90% of the quality at a fraction of the complexity.

Does reducing prompt quality to save tokens always reduce output quality?

Not necessarily. Redundant instructions and verbose context can actually reduce quality by diluting the most important parts of the prompt. Shorter, more focused prompts often produce better outputs than long, repetitive ones.

How do I reduce costs for a high-traffic chatbot?

Implement conversation summarisation to replace long histories, use prompt caching for the system prompt, route simple FAQ questions to a retrieval-only path (no LLM call), and use a smaller model for intent classification before routing to the full model.

When should I fine-tune a model instead of engineering the prompt?

Fine-tuning makes sense when you have a large volume of requests with a consistent task pattern (thousands per day), the task is simple enough that a smaller fine-tuned model can match the frontier model's quality, and the upfront fine-tuning cost is recouped within a few months. For most applications, prompt engineering and model selection achieve 90% of the quality at a fraction of the complexity.

Does reducing prompt quality to save tokens always reduce output quality?

Not necessarily. Redundant instructions and verbose context can actually reduce quality by diluting the most important parts of the prompt. Shorter, more focused prompts often produce better outputs than long, repetitive ones.

How do I reduce costs for a high-traffic chatbot?

Implement conversation summarisation to replace long histories, use prompt caching for the system prompt, route simple FAQ questions to a retrieval-only path (no LLM call), and use a smaller model for intent classification before routing to the full model.

LLM Cost Optimization: A Practical Guide

LLM API costs can grow rapidly as your application scales. A chatbot that costs $50 per day in development can cost $5,000 per day at production scale. This guide covers the full range of cost optimisation strategies — from choosing the right model to implementing caching and batching — with realistic estimates of the savings each approach provides.

Model Selection: The Biggest Lever

Choosing the right model is the most impactful cost decision. GPT-4o costs $2.50/million input tokens; GPT-4o mini costs $0.15/million — a 16x difference. For many tasks, the smaller model is sufficient: summarisation, classification, data extraction, and simple question answering. Reserve frontier models for tasks that genuinely require advanced reasoning: complex code generation, multi-step planning, nuanced analysis. A common architecture runs GPT-4o mini for routing, classification, and simple tasks, reserving GPT-4o for tasks flagged as complex by the cheaper model. This hybrid approach typically reduces costs by 50–80% compared to always using the frontier model.

Prompt Caching

Prompt caching stores the transformer's key-value state for a prefix of the context, avoiding recomputation on repeated requests. Anthropic offers cached input tokens at 10% of the normal price (90% discount), and OpenAI offers similar caching for prompts over 1,024 tokens. To benefit from caching, keep your system prompt stable and put it at the beginning of the context. Cache hit rates of 80%+ are achievable for production chatbots with consistent system prompts. At $3/million for uncached Claude input tokens and $0.30/million for cached, a 100k-token system prompt with 80% cache hit rate drops the effective input cost from $0.30 to $0.06 per request.

Reducing Input Token Count

Trim your prompts aggressively. Common sources of wasted tokens include: verbose role descriptions that could be tightened; repeated instructions present in both the system prompt and user message; full JSON objects when only 2-3 fields are needed by the task; long conversation history when a summary would suffice; and boilerplate examples that could be reduced from five to two. Run your most common prompts through a token counter and benchmark quality against shorter versions. A 20% prompt reduction is usually achievable with no quality loss. For RAG applications, semantic similarity scoring of retrieved chunks and keeping only the top 3-5 is often more effective than including the top 10.

Output Token Management

Output tokens cost 3–5x more than input tokens per unit. Setting max_tokens on every API call prevents runaway costs from unexpectedly long responses. For structured output tasks, output tokens are predictable — a JSON extraction task always produces approximately the same number of tokens regardless of input length. For open-ended generation tasks, instruct the model to be concise: "Respond in 3 bullet points" or "Maximum 150 words" reduces output tokens without requiring max_tokens truncation (which can cut off responses mid-sentence). Streaming does not reduce costs but does reduce perceived latency.

Batching with the Batch API

OpenAI and Anthropic offer batch APIs that process requests asynchronously at a 50% discount. Batch requests are processed within 24 hours, making them suitable for data pipelines, evaluation runs, content generation at scale, and any task that does not require real-time responses. For a workload of 10,000 daily AI-assisted document analyses that do not need immediate results, switching from synchronous to batch reduces the bill by 50%. Implement a queue-based architecture where time-insensitive tasks are routed to batch endpoints and time-sensitive tasks use the synchronous API.

Monitoring and Alerting

Without monitoring, cost anomalies can go undetected for days. Log the token counts (input, output, cache hit/miss) and model used for every API call. Compute cost-per-request for each task type and track it over time. Set budget alerts at 80% of your monthly limit. Identify the top 10% of requests by cost — they often reveal prompt engineering opportunities or unexpected use patterns. Build a cost dashboard that shows daily spend, average cost per user, cost per task type, and cache hit rate. A well-monitored LLM application typically reduces costs by an additional 20–30% simply by identifying and fixing the most expensive outlier requests.

Try These Tools

CST

AI API Cost Calculator

Estimate AI API costs for GPT, Claude, Gemini, and LLaMA.

TKN

AI Token Counter

Count tokens for GPT, Claude, Gemini, and LLaMA models.

BCH

AI Batch Cost Calculator

Calculate total AI API costs across multiple models and request volumes.

GPT

OpenAI API Cost Calculator

Calculate GPT-4o, GPT-4.1, and GPT-3.5 API costs.

FAQ

When should I fine-tune a model instead of engineering the prompt?: Fine-tuning makes sense when you have a large volume of requests with a consistent task pattern (thousands per day), the task is simple enough that a smaller fine-tuned model can match the frontier model's quality, and the upfront fine-tuning cost is recouped within a few months. For most applications, prompt engineering and model selection achieve 90% of the quality at a fraction of the complexity.
Does reducing prompt quality to save tokens always reduce output quality?: Not necessarily. Redundant instructions and verbose context can actually reduce quality by diluting the most important parts of the prompt. Shorter, more focused prompts often produce better outputs than long, repetitive ones.
How do I reduce costs for a high-traffic chatbot?: Implement conversation summarisation to replace long histories, use prompt caching for the system prompt, route simple FAQ questions to a retrieval-only path (no LLM call), and use a smaller model for intent classification before routing to the full model.

Token counting is the foundation of working efficiently with large language models. Every ...

Managing Context Windows in LLM Applications

Context window management is one of the most important engineering challenges in productio...

Batch Processing with LLM APIs

Batch processing transforms LLMs from interactive tools into data processing pipelines. Wh...