AI Text Normalizer

Normalize smart quotes, dashes, ligatures, and accented characters for consistent LLM input.

Related Tools

Learn More

FAQ

Why do smart quotes and dashes cause problems for LLMs?
Smart quotes (“”‘’) and typographic dashes (—–) are multi-byte Unicode characters that tokenize differently from their ASCII equivalents. This can cause inconsistent tokenization and slightly inflate token counts. Normalizing them ensures consistent processing.
What are typographic ligatures and why remove them?
Ligatures like fi (fi), fl (fl), and ff (ff) are single Unicode code points that represent combined letter pairs. Copy-pasted text from PDFs often contains them. They tokenize as rare characters rather than common letter pairs, so replacing them improves tokenization.
Does this tool remove all accented characters?
It normalizes common accented characters found in loanwords (café, naïve, résumé) to their ASCII base forms. It uses NFKD decomposition plus combining mark removal for comprehensive coverage.

Convert typographic characters to their ASCII equivalents for consistent LLM tokenization. Replaces smart quotes, em/en dashes, ellipsis, typographic ligatures (fi, fl, ff), and common accented characters. Shows a count of all replacements made.