Question 1

Why do smart quotes and dashes cause problems for LLMs?

Accepted Answer

Smart quotes (“”‘’) and typographic dashes (—–) are multi-byte Unicode characters that tokenize differently from their ASCII equivalents. This can cause inconsistent tokenization and slightly inflate token counts. Normalizing them ensures consistent processing.

Question 2

What are typographic ligatures and why remove them?

Accepted Answer

Ligatures like ﬁ (fi), ﬂ (fl), and ﬀ (ff) are single Unicode code points that represent combined letter pairs. Copy-pasted text from PDFs often contains them. They tokenize as rare characters rather than common letter pairs, so replacing them improves tokenization.

Question 3

Does this tool remove all accented characters?

Accepted Answer

It normalizes common accented characters found in loanwords (café, naïve, résumé) to their ASCII base forms. It uses NFKD decomposition plus combining mark removal for comprehensive coverage.

AI Text Normalizer

Related Tools

Learn More

FAQ