Skip to Content
XML in a Nutshell, 3rd Edition
book

XML in a Nutshell, 3rd Edition

by Elliotte Rusty Harold, W. Scott Means
September 2004
Intermediate to advanced
712 pages
24h 45m
English
O'Reilly Media, Inc.
Content preview from XML in a Nutshell, 3rd Edition

UTF-8

UTF-8 is a variable-length encoding of Unicode. Characters 0 through 127, that is, the ASCII character set, are encoded in one byte each, exactly as they would be in ASCII. In ASCII, the byte with value 65 represents the letter A. In UTF-8, the byte with the value 65 also represents the letter A. There is a one-to-one identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII files are also acceptable UTF-8 files.

UTF-8 represents the characters from 128 to 2,047, a range that covers the most common non-ideographic scripts, in two bytes each. Characters from 2,048 to 65,535—mostly from Chinese, Japanese, and Korean—are represented in three bytes each. Characters with code points above 65,535 are represented in four bytes each. For a file that’s mostly Latin text, this effectively halves the file size from what it would be in UCS-2. However, for a file that’s primarily Japanese, Chinese, Korean, or one of the languages of the Indian subcontinent, the file size can grow by 50%. For most other living languages, the file size is close to the same as it would be in UCS-2.

UTF-8 is probably the most broadly supported encoding of Unicode. For instance, it’s how Java .class files store strings, it’s the native encoding of the BeOS, and it’s the default encoding an XML processor assumes unless told otherwise by a byte-order mark or an encoding declaration. Chances are pretty good that if a program tells you it’s saving Unicode, it’s really saving UTF-8.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

XML: Visual QuickStart Guide, Second Edition

XML: Visual QuickStart Guide, Second Edition

Kevin Howard Goldberg
XML Hacks

XML Hacks

Michael Fitzgerald

Publisher Resources

ISBN: 0596007647Errata PageSupplemental Content