What is Unicode? — The Universal Character Standard
Definition
Unicode is an international character encoding standard that assigns a unique number, called a code point, to every character in every writing system used by humans. It covers over 149,000 characters from more than 150 scripts, including Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, emoji, and mathematical symbols. A Unicode code point is written as U+ followed by a hexadecimal number, such as U+0041 for the letter A or U+1F600 for the grinning face emoji.
How It Works
Unicode defines code points but not how they are stored in memory or files — that is the role of encodings like UTF-8 and UTF-16. UTF-8 is the most widely used encoding: it uses 1 byte for ASCII characters (U+0000 to U+007F), 2 bytes for common European and Middle Eastern scripts, 3 bytes for most Asian scripts, and 4 bytes for rare scripts and emoji. UTF-16 uses 2 bytes for most characters and 4 bytes (surrogate pairs) for those outside the Basic Multilingual Plane. UTF-8 is preferred on the web because it is backward-compatible with ASCII and efficient for English text.
Common Use Cases
- ▸Displaying text in any human language on a web page without special handling
- ▸Storing and processing multilingual user data in databases and APIs
- ▸Exchanging emoji and special symbols between applications reliably
- ▸Building internationalized (i18n) and localized (l10n) software
- ▸Encoding source code comments and string literals in non-ASCII languages
Example
U+0041 → A (Latin capital letter A) U+00E9 → é (Latin small letter e with acute) U+4E2D → 中 (CJK unified ideograph) U+1F600 → 😀 (grinning face emoji) UTF-8 for "é": 0xC3 0xA9 (2 bytes) UTF-8 for "😀": 0xF0 0x9F 0x98 0x80 (4 bytes)
Related Tools
FAQ
- What is the difference between Unicode and UTF-8?
- Unicode is the standard that defines code points (numbers for characters). UTF-8 is one encoding of those code points into bytes. There are other encodings of Unicode, such as UTF-16 and UTF-32. When people say "Unicode" in a programming context, they often mean UTF-8 specifically.
- Why are emoji 4 bytes in UTF-8?
- Emoji have code points above U+FFFF, which is the boundary of the Basic Multilingual Plane. These higher code points require 4 bytes in UTF-8 and a surrogate pair in UTF-16. This is why emoji can cause bugs in string length calculations that assume 2 bytes per character.
- What is a BOM (Byte Order Mark)?
- The BOM is the character U+FEFF placed at the start of a file to indicate byte order and encoding. UTF-8 with BOM is not recommended for web use because it can cause parsing issues. UTF-8 without BOM is the standard for HTML and JSON files.