UTF-8 vs Unicode — Encoding vs Standard Explained
UTF-8 and Unicode are frequently conflated but they describe different things: Unicode is a character standard that assigns code points to characters, while UTF-8 is a specific encoding of those code points into bytes. Confusing the two leads to misunderstandings about character encoding, byte sizes, and string handling in programming languages. This comparison clarifies what each does and why the distinction matters for developers.
Comparison Table
| Aspect | UTF-8 | Unicode |
|---|---|---|
| What it is | An encoding format that serializes Unicode code points as bytes | A standard that assigns a unique number to every character |
| Byte representation | 1–4 bytes per character; ASCII is 1 byte | Defines code points (numbers), not bytes |
| ASCII compatibility | Fully backward-compatible with ASCII (0x00–0x7F) | Unicode includes ASCII as its first 128 code points |
| Alternative encodings | One of several Unicode encodings | UTF-8, UTF-16, UTF-32 all encode Unicode |
| Web usage | The dominant encoding for HTML, JSON, URLs on the web | The standard for all text; not an encoding itself |
| JavaScript strings | Used for I/O and storage; JS source files should be UTF-8 | JavaScript strings are internally UTF-16 (code units) |
When to Use UTF-8
UTF-8 is the correct encoding for virtually all web development — HTML files, JSON, XML, source code, database text columns, and API payloads should all be UTF-8. Its ASCII compatibility means English text is stored as 1 byte per character, and all other Unicode characters are 2–4 bytes. UTF-8 is the default encoding for the web and most modern systems.
When to Use Unicode
"Unicode" is not an encoding you choose — it is a standard your system supports. When you choose UTF-8, you are choosing how to encode Unicode code points as bytes. When people say "use Unicode" they typically mean "use an encoding that supports the full Unicode range" rather than ASCII-only or a legacy single-byte encoding like ISO-8859-1.
Convert Between UTF-8 and Unicode
FAQ
- Why do JavaScript strings behave oddly with emoji?
- JavaScript strings are sequences of UTF-16 code units, not Unicode code points. Characters above U+FFFF (like most emoji) are stored as surrogate pairs — two UTF-16 code units. This means emoji have a .length of 2 in JavaScript, even though they are a single character. Use [...str] or the string spread operator to get an array of actual characters.
- What is the difference between UTF-8 and UTF-16?
- UTF-8 uses 1–4 bytes per character and is ASCII-compatible. UTF-16 uses 2 bytes for most characters (Basic Multilingual Plane) and 4 bytes for characters above U+FFFF. UTF-8 is more efficient for ASCII-heavy text; UTF-16 is more efficient for Asian scripts. Windows internal APIs and JavaScript use UTF-16; the web uses UTF-8.
- What is a BOM (Byte Order Mark) and should I use it?
- The BOM is the Unicode character U+FEFF placed at the start of a file to indicate encoding and byte order. UTF-8 with BOM is not recommended for web files (HTML, JSON) because it can cause parsing errors. UTF-8 does not need a BOM because it has no byte-order ambiguity. Only use UTF-8 BOM when specifically required by a tool.