Clean and Format a Messy Prompt
Prompts written in word processors or copied from websites often contain invisible formatting issues that silently degrade LLM performance. Smart quotes (“”) instead of straight quotes (""), em dashes (—) instead of double hyphens, non-breaking spaces, and extra blank lines all affect how models parse instructions — and some can cause JSON serialization errors if the prompt is included in an API request body. This example shows a typical messy prompt copied from a Google Doc and demonstrates how the formatter normalizes it to clean ASCII text ready for API submission. The most common issue is smart quotes: when a prompt contains role-play instructions like Tell the model to respond as a “helpful assistant”, the curly quotes may tokenize as unexpected characters depending on the model’s tokenizer. In code generation prompts, a function name like ‘parseJSON’ with curly single quotes might cause the generated code to fail if the model copies those characters into a string literal. The formatter replaces all curly quotes with their straight equivalents. Extra whitespace is the second most common issue: multiple consecutive spaces are collapsed, trailing spaces at line ends are removed, and sequences of more than two blank lines are reduced to a single blank line. This produces a compact, predictable prompt that tokenizes consistently across runs.
You are a helpful assistant that answers questions about software development. When the user asks a question, respond with a clear explanation. Always use bullet points for lists, and wrap code in backticks like `this`. Do NOT include unnecessary preamble — get straight to the answer. If you don’t know the answer, say “I don’t know” rather than guessing.
FAQ
- Why do smart quotes cause problems in prompts?
- Smart quotes are Unicode characters (“”‘’) that can tokenize unexpectedly. They may also break JSON serialization if the prompt is embedded in an API request body, since they are valid JSON characters but not standard ASCII.
- Does extra whitespace affect LLM output quality?
- Extra spaces and blank lines increase token count and can disrupt the model's attention patterns, especially for instruction-following tasks. Clean, consistent formatting produces more reliable results.
- What other characters should I watch out for?
- Em dashes (—), en dashes (–), ellipses (…), non-breaking spaces (\u00A0), and zero-width spaces (\u200B) are all common culprits when copying from documents or web pages.
Related Examples
Verbose prompts are not just wasteful — they actively hurt performance. When a p...
Split a Long Document for AI ProcessingWhen a document is too long to fit in a single LLM context window, it must be sp...
Clean HTML-Heavy Text for AI ProcessingWeb-scraped content is almost never ready to feed directly to an LLM. HTML tags,...