AI Text Normalizer
Normalize smart quotes, dashes, ligatures, and accented characters for consistent LLM input.
Convert typographic characters to their ASCII equivalents for consistent LLM tokenization. Replaces smart quotes, em/en dashes, ellipsis, typographic ligatures (fi, fl, ff), and common accented characters. Shows a count of all replacements made.
Related Tools
Clean and sanitize text for LLM input by stripping HTML, normalizing Unicode, and collapsing whitespace.
Full preprocessing pipeline for LLM input: trim, normalize, strip HTML, collapse whitespace, and truncate to context window.
Remove invisible Unicode, escape injection keywords, and strip dangerous content from LLM input.
Remove duplicate and near-duplicate lines from text using exact matching and Jaccard similarity.
Learn More
FAQ
- Why do smart quotes and dashes cause problems for LLMs?
- Smart quotes (“”‘’) and typographic dashes (—–) are multi-byte Unicode characters that tokenize differently from their ASCII equivalents. This can cause inconsistent tokenization and slightly inflate token counts. Normalizing them ensures consistent processing.
- What are typographic ligatures and why remove them?
- Ligatures like fi (fi), fl (fl), and ff (ff) are single Unicode code points that represent combined letter pairs. Copy-pasted text from PDFs often contains them. They tokenize as rare characters rather than common letter pairs, so replacing them improves tokenization.
- Does this tool remove all accented characters?
- It normalizes common accented characters found in loanwords (café, naïve, résumé) to their ASCII base forms. It uses NFKD decomposition plus combining mark removal for comprehensive coverage.