Why do JavaScript strings behave oddly with emoji?

JavaScript strings are sequences of UTF-16 code units, not Unicode code points. Characters above U+FFFF (like most emoji) are stored as surrogate pairs — two UTF-16 code units. This means emoji have a .length of 2 in JavaScript, even though they are a single character. Use [...str] or the string spread operator to get an array of actual characters.

What is the difference between UTF-8 and UTF-16?

UTF-8 uses 1–4 bytes per character and is ASCII-compatible. UTF-16 uses 2 bytes for most characters (Basic Multilingual Plane) and 4 bytes for characters above U+FFFF. UTF-8 is more efficient for ASCII-heavy text; UTF-16 is more efficient for Asian scripts. Windows internal APIs and JavaScript use UTF-16; the web uses UTF-8.

What is a BOM (Byte Order Mark) and should I use it?

The BOM is the Unicode character U+FEFF placed at the start of a file to indicate encoding and byte order. UTF-8 with BOM is not recommended for web files (HTML, JSON) because it can cause parsing errors. UTF-8 does not need a BOM because it has no byte-order ambiguity. Only use UTF-8 BOM when specifically required by a tool.

Why do JavaScript strings behave oddly with emoji?

JavaScript strings are sequences of UTF-16 code units, not Unicode code points. Characters above U+FFFF (like most emoji) are stored as surrogate pairs — two UTF-16 code units. This means emoji have a .length of 2 in JavaScript, even though they are a single character. Use [...str] or the string spread operator to get an array of actual characters.

What is the difference between UTF-8 and UTF-16?

UTF-8 uses 1–4 bytes per character and is ASCII-compatible. UTF-16 uses 2 bytes for most characters (Basic Multilingual Plane) and 4 bytes for characters above U+FFFF. UTF-8 is more efficient for ASCII-heavy text; UTF-16 is more efficient for Asian scripts. Windows internal APIs and JavaScript use UTF-16; the web uses UTF-8.

What is a BOM (Byte Order Mark) and should I use it?

The BOM is the Unicode character U+FEFF placed at the start of a file to indicate encoding and byte order. UTF-8 with BOM is not recommended for web files (HTML, JSON) because it can cause parsing errors. UTF-8 does not need a BOM because it has no byte-order ambiguity. Only use UTF-8 BOM when specifically required by a tool.

UTF-8 vs Unicode — Encoding vs Standard Explained

UTF-8 and Unicode are frequently conflated but they describe different things: Unicode is a character standard that assigns code points to characters, while UTF-8 is a specific encoding of those code points into bytes. Confusing the two leads to misunderstandings about character encoding, byte sizes, and string handling in programming languages. This comparison clarifies what each does and why the distinction matters for developers.

Comparison Table

Aspect	UTF-8	Unicode
What it is	An encoding format that serializes Unicode code points as bytes	A standard that assigns a unique number to every character
Byte representation	1–4 bytes per character; ASCII is 1 byte	Defines code points (numbers), not bytes
ASCII compatibility	Fully backward-compatible with ASCII (0x00–0x7F)	Unicode includes ASCII as its first 128 code points
Alternative encodings	One of several Unicode encodings	UTF-8, UTF-16, UTF-32 all encode Unicode
Web usage	The dominant encoding for HTML, JSON, URLs on the web	The standard for all text; not an encoding itself
JavaScript strings	Used for I/O and storage; JS source files should be UTF-8	JavaScript strings are internally UTF-16 (code units)

When to Use UTF-8

UTF-8 is the correct encoding for virtually all web development — HTML files, JSON, XML, source code, database text columns, and API payloads should all be UTF-8. Its ASCII compatibility means English text is stored as 1 byte per character, and all other Unicode characters are 2–4 bytes. UTF-8 is the default encoding for the web and most modern systems.

When to Use Unicode

"Unicode" is not an encoding you choose — it is a standard your system supports. When you choose UTF-8, you are choosing how to encode Unicode code points as bytes. When people say "use Unicode" they typically mean "use an encoding that supports the full Unicode range" rather than ASCII-only or a legacy single-byte encoding like ISO-8859-1.

Convert Between UTF-8 and Unicode

\uUnicode Encode

Convert text to Unicode escape sequences (\uXXXX format).

U+Unicode Decode

Convert \uXXXX Unicode escape sequences back to readable text.

B64Base64 EncodeHOT

Encode text or binary data to Base64 format.

%URL EncodeHOT

Encode text for safe use in URLs using percent-encoding.

FAQ

Why do JavaScript strings behave oddly with emoji?: JavaScript strings are sequences of UTF-16 code units, not Unicode code points. Characters above U+FFFF (like most emoji) are stored as surrogate pairs — two UTF-16 code units. This means emoji have a .length of 2 in JavaScript, even though they are a single character. Use [...str] or the string spread operator to get an array of actual characters.
What is the difference between UTF-8 and UTF-16?: UTF-8 uses 1–4 bytes per character and is ASCII-compatible. UTF-16 uses 2 bytes for most characters (Basic Multilingual Plane) and 4 bytes for characters above U+FFFF. UTF-8 is more efficient for ASCII-heavy text; UTF-16 is more efficient for Asian scripts. Windows internal APIs and JavaScript use UTF-16; the web uses UTF-8.
What is a BOM (Byte Order Mark) and should I use it?: The BOM is the Unicode character U+FEFF placed at the start of a file to indicate encoding and byte order. UTF-8 with BOM is not recommended for web files (HTML, JSON) because it can cause parsing errors. UTF-8 does not need a BOM because it has no byte-order ambiguity. Only use UTF-8 BOM when specifically required by a tool.

Related Comparisons

Base64 vs URL Encoding