Why do smart quotes cause problems in LLM prompts?

Smart quotes are multi-byte UTF-8 characters that tokenize differently from ASCII quotes. In few-shot examples, a smart-quoted string may tokenize into a different number of tokens than the straight-quoted version, confusing the model about the intended pattern.

Does this normalizer affect text in languages other than English?

The normalizer only targets punctuation characters and whitespace — it does not change any alphabetic characters including accented letters and non-Latin scripts. Safe to use on multilingual text.

Should I normalize before or after chunking for RAG?

Normalize before chunking. Consistent characters produce consistent embeddings, which improves retrieval accuracy. Normalizing after chunking can shift chunk boundaries and invalidate cached chunk hashes.

Normalize Text with Smart Quotes and Em Dashes for LLM Input

Text copied from word processors, PDFs, and websites often contains Unicode typography characters that look identical to their ASCII equivalents but are encoded differently: smart quotes (“”‘’) instead of straight quotes ("'"), em dashes (—) instead of double hyphens (--), non-breaking spaces (U+00A0) instead of regular spaces, ellipsis characters (…) instead of three dots (...), and various other typographic substitutions. These characters cause subtle failures in string matching, tokenization, and downstream text processing. LLMs tokenize smart quotes and straight quotes differently, which affects token counts and can change how models interpret quoted text in few-shot examples. Non-breaking spaces break word-boundary regex patterns. Em dashes cause tokenization variance across models. Normalizing to ASCII equivalents before sending to an LLM produces more predictable tokenization and makes your prompts behave consistently regardless of where the input text originated. The normalizer converts the full set of Unicode typographic characters to their ASCII equivalents and flags any remaining non-ASCII characters for review. It also collapses multiple spaces, trims leading and trailing whitespace, and normalizes line endings to LF — the three most common whitespace issues in text piped through copy-paste workflows.

Example

The system’s response was “Unexpected error” — which wasn’t helpful at all…

Key findings from the Q3 report:
• Revenue increased 12% year-over-year
• Operating costs remained “within budget” per the CFO’s note
• The new feature—launched in July—drove 40% of growth

[ open in AI Text Normalizer → ]

FAQ

Why do smart quotes cause problems in LLM prompts?: Smart quotes are multi-byte UTF-8 characters that tokenize differently from ASCII quotes. In few-shot examples, a smart-quoted string may tokenize into a different number of tokens than the straight-quoted version, confusing the model about the intended pattern.
Does this normalizer affect text in languages other than English?: The normalizer only targets punctuation characters and whitespace — it does not change any alphabetic characters including accented letters and non-Latin scripts. Safe to use on multilingual text.
Should I normalize before or after chunking for RAG?: Normalize before chunking. Consistent characters produce consistent embeddings, which improves retrieval accuracy. Normalizing after chunking can shift chunk boundaries and invalidate cached chunk hashes.

Related Examples

Clean HTML-Heavy Text for AI Processing

Web-scraped content is almost never ready to feed directly to an LLM. HTML tags,...

Clean and Format Raw LLM Markdown Output

LLMs frequently produce markdown that is technically valid but visually inconsis...

Deduplicate a Training Dataset List

Duplicate entries in AI training datasets are a silent quality problem. Exact du...

Format CSV Data for AI Fine-Tuning

Fine-tuning LLMs on custom datasets requires converting raw training data into t...