Why do smart quotes cause problems in prompts?

Smart quotes are Unicode characters (“”‘’) that can tokenize unexpectedly. They may also break JSON serialization if the prompt is embedded in an API request body, since they are valid JSON characters but not standard ASCII.

Does extra whitespace affect LLM output quality?

Extra spaces and blank lines increase token count and can disrupt the model's attention patterns, especially for instruction-following tasks. Clean, consistent formatting produces more reliable results.

What other characters should I watch out for?

Em dashes (—), en dashes (–), ellipses (…), non-breaking spaces (\u00A0), and zero-width spaces (\u200B) are all common culprits when copying from documents or web pages.

Clean and Format a Messy Prompt

Prompts written in word processors or copied from websites often contain invisible formatting issues that silently degrade LLM performance. Smart quotes (“”) instead of straight quotes (""), em dashes (—) instead of double hyphens, non-breaking spaces, and extra blank lines all affect how models parse instructions — and some can cause JSON serialization errors if the prompt is included in an API request body. This example shows a typical messy prompt copied from a Google Doc and demonstrates how the formatter normalizes it to clean ASCII text ready for API submission. The most common issue is smart quotes: when a prompt contains role-play instructions like Tell the model to respond as a “helpful assistant”, the curly quotes may tokenize as unexpected characters depending on the model’s tokenizer. In code generation prompts, a function name like ‘parseJSON’ with curly single quotes might cause the generated code to fail if the model copies those characters into a string literal. The formatter replaces all curly quotes with their straight equivalents. Extra whitespace is the second most common issue: multiple consecutive spaces are collapsed, trailing spaces at line ends are removed, and sequences of more than two blank lines are reduced to a single blank line. This produces a compact, predictable prompt that tokenizes consistently across runs.

Example

You are a helpful assistant  that answers questions  about software development.

When  the user asks a question,  respond with a clear   explanation.

Always  use bullet points   for lists,  and  wrap  code  in  backticks like `this`.

Do NOT  include unnecessary  preamble   — get   straight   to  the  answer.

If you don’t  know   the answer,  say  “I don’t know”  rather than  guessing.

[ open in AI Prompt Formatter → ]

FAQ

Why do smart quotes cause problems in prompts?: Smart quotes are Unicode characters (“”‘’) that can tokenize unexpectedly. They may also break JSON serialization if the prompt is embedded in an API request body, since they are valid JSON characters but not standard ASCII.
Does extra whitespace affect LLM output quality?: Extra spaces and blank lines increase token count and can disrupt the model's attention patterns, especially for instruction-following tasks. Clean, consistent formatting produces more reliable results.
What other characters should I watch out for?: Em dashes (—), en dashes (–), ellipses (…), non-breaking spaces (\u00A0), and zero-width spaces (\u200B) are all common culprits when copying from documents or web pages.

Related Examples

Optimize a Verbose Prompt

Verbose prompts are not just wasteful — they actively hurt performance. When a p...

Split a Long Document for AI Processing

When a document is too long to fit in a single LLM context window, it must be sp...

Clean HTML-Heavy Text for AI Processing

Web-scraped content is almost never ready to feed directly to an LLM. HTML tags,...