What is the difference between Unicode and UTF-8?

Unicode is the standard that assigns code points to characters. UTF-8 is one of several encodings that define how those code points are stored as bytes. UTF-8 uses 1–4 bytes per character and is backward-compatible with ASCII.

Why do some emoji show as two characters in JavaScript?

JavaScript strings are UTF-16 encoded. Code points above U+FFFF (including most emoji) require two 16-bit code units called a surrogate pair. The emoji length in JavaScript may be 2 even though visually it is one character.

How do I safely store emoji in a MySQL database?

Use the utf8mb4 character set in MySQL. The older utf8 charset only supports 3-byte UTF-8 sequences and cannot store emoji (which require 4 bytes). Set the column and connection charset to utf8mb4.

Unicode and Emoji Encoding

Unicode assigns a unique code point (U+XXXX) to every character in every human writing system plus thousands of symbols and emoji. In source code and databases you may encounter characters in their native form, as Unicode escapes, or as UTF-8 byte sequences. This collection shows the relationship between the visual character, its Unicode code point, its UTF-8 encoding, and its JavaScript/Python escape representation. The Unicode encoder converts any text to its escape sequences and back, which is useful for debugging encoding issues in APIs and databases. Unicode is a universal character catalog, currently defining over 149,000 characters across 161 scripts. Each character has a code point — a unique integer from 0 to 1,114,111 (U+10FFFF). Code points are written as U+ followed by at least four uppercase hex digits: U+0041 is the letter A, U+1F600 is the grinning face emoji. Unicode itself doesn't specify how code points are stored in bytes — that's the job of encoding schemes like UTF-8, UTF-16, and UTF-32. UTF-8 is the dominant encoding for the web and most modern systems. It's variable-width: ASCII characters (U+0000 to U+007F) use one byte and are identical to their ASCII values, which makes UTF-8 backward-compatible with ASCII. Characters from U+0080 to U+07FF use two bytes. Characters from U+0800 to U+FFFF use three bytes, covering most of the world's scripts including Chinese, Japanese, Korean, Arabic, and Hebrew. Characters from U+10000 to U+10FFFF — including most emoji — use four bytes. The variable width means a 4-character emoji string occupies 16 bytes in memory with UTF-8. JavaScript strings are internally encoded as UTF-16, not UTF-8. Characters from U+0000 to U+FFFF are stored as a single 16-bit code unit, matching their code point. Characters above U+FFFF — including most emoji — require two 16-bit units called a surrogate pair. The grinning face emoji U+1F600 is stored as the surrogate pair \uD83D\uDE00. This explains why emoji.length === 2 in JavaScript: the String.length property counts UTF-16 code units, not visual characters. For correct character counting, use the spread operator [...emoji].length or the Intl.Segmenter API. Python 3 strings are Unicode by default, with no surrogate pairs — Python uses actual code points internally. The escape sequence \u accepts exactly four hex digits (for U+0000 to U+FFFF) and \U accepts eight hex digits (for the full range including emoji). In Python, len('😀') === 1 because Python counts code points, not bytes. Common encoding bugs: MySQL's utf8 charset is actually a 3-byte subset of UTF-8 that cannot store 4-byte characters like emoji — use utf8mb4 instead. APIs that return JSON with \uXXXX escapes are valid: the JSON spec allows Unicode escapes as an alternative to the literal characters. When you see question marks (?) replacing characters in database output, it indicates that the connection charset or column charset doesn't support the inserted characters. To debug, compare the hex byte values at each stage of the data pipeline to identify where the encoding is being corrupted.

Example

# Emoji with Unicode code points
😀 = U+1F600 = \uD83D\uDE00 (JS surrogate pair) = \U0001F600 (Python)
❤️  = U+2764 U+FE0F = \u2764\uFE0F
🌍 = U+1F30D = \uD83C\uDF0D
✅ = U+2705 = \u2705

# Common symbols
© = U+00A9 = \u00A9
→ = U+2192 = \u2192
∞ = U+221E = \u221E
° = U+00B0 = \u00B0

# Latin extended
é = U+00E9 = \u00E9 (UTF-8: 0xC3 0xA9)
ñ = U+00F1 = \u00F1 (UTF-8: 0xC3 0xB1)

[ open in Unicode Encode → ]

FAQ

What is the difference between Unicode and UTF-8?: Unicode is the standard that assigns code points to characters. UTF-8 is one of several encodings that define how those code points are stored as bytes. UTF-8 uses 1–4 bytes per character and is backward-compatible with ASCII.
Why do some emoji show as two characters in JavaScript?: JavaScript strings are UTF-16 encoded. Code points above U+FFFF (including most emoji) require two 16-bit code units called a surrogate pair. The emoji length in JavaScript may be 2 even though visually it is one character.
How do I safely store emoji in a MySQL database?: Use the utf8mb4 character set in MySQL. The older utf8 charset only supports 3-byte UTF-8 sequences and cannot store emoji (which require 4 bytes). Set the column and connection charset to utf8mb4.

Related Examples

HTML Entity Encoding Examples

HTML entity encoding converts characters that have special meaning in HTML into ...

URL-Encode Special Characters

URL percent-encoding (also called URL encoding) is a critical security and corre...

Common Base64 Encoding Examples

Base64 encoding is used in dozens of everyday web contexts including HTTP Basic ...