$devtoolkit.sh/compare/utf8-vs-unicode

UTF-8 vs Unicode — Encoding vs Standard Explained

UTF-8 and Unicode are frequently conflated but they describe different things: Unicode is a character standard that assigns code points to characters, while UTF-8 is a specific encoding of those code points into bytes. Confusing the two leads to misunderstandings about character encoding, byte sizes, and string handling in programming languages. This comparison clarifies what each does and why the distinction matters for developers.

Comparison Table

AspectUTF-8Unicode
What it isAn encoding format that serializes Unicode code points as bytesA standard that assigns a unique number to every character
Byte representation1–4 bytes per character; ASCII is 1 byteDefines code points (numbers), not bytes
ASCII compatibilityFully backward-compatible with ASCII (0x00–0x7F)Unicode includes ASCII as its first 128 code points
Alternative encodingsOne of several Unicode encodingsUTF-8, UTF-16, UTF-32 all encode Unicode
Web usageThe dominant encoding for HTML, JSON, URLs on the webThe standard for all text; not an encoding itself
JavaScript stringsUsed for I/O and storage; JS source files should be UTF-8JavaScript strings are internally UTF-16 (code units)

When to Use UTF-8

UTF-8 is the correct encoding for virtually all web development — HTML files, JSON, XML, source code, database text columns, and API payloads should all be UTF-8. Its ASCII compatibility means English text is stored as 1 byte per character, and all other Unicode characters are 2–4 bytes. UTF-8 is the default encoding for the web and most modern systems.

When to Use Unicode

"Unicode" is not an encoding you choose — it is a standard your system supports. When you choose UTF-8, you are choosing how to encode Unicode code points as bytes. When people say "use Unicode" they typically mean "use an encoding that supports the full Unicode range" rather than ASCII-only or a legacy single-byte encoding like ISO-8859-1.

Convert Between UTF-8 and Unicode

FAQ

Why do JavaScript strings behave oddly with emoji?
JavaScript strings are sequences of UTF-16 code units, not Unicode code points. Characters above U+FFFF (like most emoji) are stored as surrogate pairs — two UTF-16 code units. This means emoji have a .length of 2 in JavaScript, even though they are a single character. Use [...str] or the string spread operator to get an array of actual characters.
What is the difference between UTF-8 and UTF-16?
UTF-8 uses 1–4 bytes per character and is ASCII-compatible. UTF-16 uses 2 bytes for most characters (Basic Multilingual Plane) and 4 bytes for characters above U+FFFF. UTF-8 is more efficient for ASCII-heavy text; UTF-16 is more efficient for Asian scripts. Windows internal APIs and JavaScript use UTF-16; the web uses UTF-8.
What is a BOM (Byte Order Mark) and should I use it?
The BOM is the Unicode character U+FEFF placed at the start of a file to indicate encoding and byte order. UTF-8 with BOM is not recommended for web files (HTML, JSON) because it can cause parsing errors. UTF-8 does not need a BOM because it has no byte-order ambiguity. Only use UTF-8 BOM when specifically required by a tool.

Related Comparisons

/compare/utf8-vs-unicodev1.0.0