book

Unicode Explained

Name: Unicode Explained
Author: Jukka K. Korpela
ISBN: 9780596101213

by Jukka K. Korpela

June 2006

Beginner

688 pages

26h 18m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Table of Contents
Preface
Audience
Assumptions and Approach
Contents of This Book
Self-Assessment Test
Conventions Used in This Book
Using Code Examples
Safari® Enabled
How to Contact Us
Acknowledgments
Part I. Working with Characters
Chapter 1. Characters as Data
Introduction to Characters and UnicodeWhy Unicode?Unicode Can Be Easy

What’s in a Character? (1/5)
Why Do We Need to Know About Characters?Characters as Units of TextCharacters as abstractionsVariation of appearance or different characters?Variation in shape turned into a character difference
What’s in a Character? (2/5)
Characters and “abstract characters”Characters and other units of textCharacters Versus ImagesProcessing of Characters
What’s in a Character? (3/5)
Giving Identity to CharactersDefinitions of characters in standardsAnnotations used to emphasize differencesThe representative glyphsThe number and the Unicode name as identifiersUnicode is more explicitSpelling of names and the U+nnnn conventionUnicode Definitions of Characters
What’s in a Character? (4/5)
Definitions of Characters ElsewhereWhat’s in a Name?Should We Be Strict About the Meanings of Characters?Ambiguity Among Characters
What’s in a Character? (5/5)
How Do I Find My Character?Which Characters Does Each Language Use?
Variation of Writing Systems
Glyphs and Fonts (1/2)
Allowed Variation of GlyphsFonts and Their PropertiesFont Variation Versus CharactersFonts in ImplementationsFailures to Display a Character
Glyphs and Fonts (2/2)
Font Embedding
Definitions of Character Repertoires
Formally Defined RepertoiresPractical Repertoires
Numbering Characters
Hexadecimal NotationNumbers as Indexes
Making Use of Character Numbers
Encoding Characters as Octet Sequences (1/2)Plain Text and Other Formats for TextBytes and OctetsCharacter Encodings
Encoding Characters as Octet Sequences (2/2)
Single-Octet EncodingsMulti-Octet Encodings
The “Character Set” Confusion
Working with Encodings (1/2)Selecting the Encoding When SavingHow Encodings Should Be DetectedSetting the Encoding ManuallySending Unicode Email
Working with Encodings (2/2)
Viewing Web Pages in Different EncodingsCommon Confusion: Encoding Versus Language
Working with Fonts (1/2)
Installing Additional SupportFont Support in Web BrowsersFont Substitution: a Solution and a ProblemPrinter FontsFinding Fonts
Working with Fonts (2/2)
Fonts in Web AuthoringThe fallback problemEffects of browser settings
Summaries
Summary of DefinitionsSummary of Concept Levels
Chapter 2. Writing Characters
Method VarietiesA Simple Way or a Universal Way?An Overview of MethodsChoosing Fonts
Keyboard Variation and Settings
Typing Characters—Just Pressing a Key?Keyboard Limitations and VariationAuxiliary KeysDead Keys
Virtual Keyboards
A Keyboard on ScreenVirtual Keys for Character Input in Forms
Program Commands (1/2)
Copying via the ClipboardMenu CommandsInsertion menu in ThunderbirdSymbol (character) insertion menu in MS WordThe Show Formatting (Show ¶) toolMethods Using the Alt Key on WindowsThe Alt-0n method
Program Commands (2/2)
The code page–specific Alt-n methodThe Unicode-based Alt-n methodThe Alt-X methodThe Alt-+n method
Ctrl-Q and Other Methods in Emacs
Character MapsCharacter Map in MS WordWindows Character Map
Replacements on the Fly (1/2)
Default Replacements in MS WordViewing and changing the rulesLanguage dependencyAutoformatting in MS WordExample: quotation marksDefining Your Own Shortcuts
Replacements on the Fly (2/2)
Special Techniques
Combining Diacritic MarksSpacing Between CharactersInputting East Asian Characters
Escape Sequences (1/2)
Examples of Escape NotationsCSSPostScriptRTFTeXNotations for Human ReadersExplanations to Human Readers
Escape Sequences (2/2)
HTML, SGML, and XML Notations for CharactersCharacter and entity references in web authoringThe role and use of character and entity referencesDefinition: character referenceDefinition: entity referenceEntity references in HTMLCharacter entities in XML
Specialized Editors
BabelPad
UniPad
Exercise
Chapter 3. Character Sets and Encodings
Good Old ASCII
American OriginThe ASCII RepertoireThe ASCII EncodingISO 646 and National Variants of ASCIISubsets of ASCII for Safety
The Misnomer “8-bit ASCII”
ISO 8859 CodesISO 8859-1 (ISO Latin 1)Names of EncodingsOther ISO 8859 Codes
Windows Latin 1 and Other Windows Codes
Windows Latin 1Other Windows Character Codes
Other 8-bit Codes (1/2)
DOS Code PagesMac EncodingsEBCDICThe Cyrillic KOI8 Encodings
Other 8-bit Codes (2/2)
Ad Hoc “8-bit Codes” Defined by Fonts
Unicode and UTF-8
The Conceptual Model: Levels of CodingThe Internet (IAB) modelThe four-level Unicode modelTransfer Encoding SyntaxEncodings for Unicode
Saving as Unicode
Encodings for East Asian LanguageVietnamese 8-bit CodesEncodings for ChineseEncodings for Japanese
Encodings for Korean
Converters and TranscodingTranscoding ToolsFree Recode
The iconv Converter
Using Character Codes (1/2)Repertoire RequirementsEncodings and the InternetEncoding in Offline DataCommon Choices of Encoding
Using Character Codes (2/2)
Sources of InformationExercisesTesting encodings“Deciphering” text
Part II. A Systematic Look at Unicode
Chapter 4. The Structure of Unicode
Design PrinciplesGoals: Universality, Efficiency, UnambiguityThe 10 Design PrinciplesUnification
Conformance Requirements
Unicode and ISO 10646Why Go Beyond 16 Bits?Does Unicode Contain All Characters in the World?
Identity of Characters
Characters as elementary units of textUnicode numbersUnicode names of charactersUsing the namesCharacters used in character names
Case of letters in names
Notational issuesUCS Sequence Identifiers (USI) and named character sequences
Versions of Unicode
Coding Space (1/3)
PlanesAllocation AreasRows and BlocksUnicode as Extension of ISO-8859-1Internal Structure of Blocks
Coding Space (2/3)
Noncharacter Code PointsClassification of Code PointsSurrogates
Coding Space (3/3)
Unassigned Code Points and Private Use
Unicode Terms
Deprecated and Obsolete CharactersDigraphsText Elements
Unicode Strings
Guide to the Unicode Standard (1/2)Accessing the Unicode VersionsWhat Material Constitutes the Unicode Standard?Viewing the Standard OnlineThe Chapters of the StandardHow Do I Find All the Information About a Character?The Zvon database
Guide to the Unicode Standard (2/2)
Using UnibookUsing the Unicode standardAdditional Reference Material
Unicode and Fonts
Unicode as Plain TextFont Variants as CharactersVariation SelectorsAffecting Font UsageLigaturesVowels as MarksOperations on GlyphsUnicode Versus Font Tricks
Criticism of Unicode (1/2)
Overall ComplexityInefficiency?Is It Reasonable to Require Support for 100,000 Characters?Cultural BiasLack of precomposed charactersEast Asian languagesFavoring UTF-8
Criticism of Unicode (2/2)
Excessive UnificationSemantic Disambiguation Frowned UponMisleading Names of CharactersConcepts and DefinitionsIllogical Division into Blocks
Questions and Answers
Where Can I Find Tools for Using Unicode?Why Do People Call Unicode a 16-Bit Code?How Can I Have a Character Added to Unicode?How Can I Check That I’ve Understood the Principles?
Chapter 5. Properties of Characters
Character Classification
The Purposes of ClassificationGeneral Category Values
Use of General Category in Programming
An Overview of Properties (1/3)Summary of Properties
An Overview of Properties (2/3)
Normative and Informative PropertiesStructure of Database Files
An Overview of Properties (3/3)
Compositions and Decompositions (1/3)
The Impact of Diacritic MarksPrecomposed and decomposed formCombining marks: powerful, but still poorly supportedFeatures that are not diacritic marksCompatibility Mappings and Canonical MappingsDifference between canonical and compatibility mappings
Compositions and Decompositions (2/3)
Canonical and compatibility equivalenceThe meaning of canonical mappingDifferences in glyphs for equivalent charactersHow the mappings are definedCanonical Decomposition and Compatibility DecompositionCanonical decompositionCanonical Ordering Behavior
Compositions and Decompositions (3/3)
Canonical equivalenceCompatibility decomposition and equivalenceCanonical and compatibility decomposable charactersCompatibility CharactersCompatibility Decomposable CharactersAvoiding Compatibility Characters
Compatibility Characters for Ligatures
Normalization (1/2)Normalization Versus FoldingOverview of Normalization FormsUse of normalization formsInvariance of Basic Latin charactersNormalization Form C
Normalization (2/2)
Normalization Form KCComposition ExclusionsDefinition of Compatibility Decomposable Character
W3C Normalization
Case PropertiesRecognizing Uppercase, Lowercase, and TitlecaseCase MappingsCase Folding in UnicodeViewing the MappingsCharacter Case Mappings Versus Visual Mappings
Collation and Sorting (1/2)
Sorting Characters Versus Sorting StringsCollation and UnicodeLayered Model of CollationCode Point Order Versus Collating OrderCode point order is unnaturalUsing code point order as a fallback in definitionsCode point order sorting for technical reasons
Collation and Sorting (2/2)
Problems of legacy softwareUnicode Collation Algorithm
Text Boundaries
Directionality (1/2)
Writing Direction of TextBidirectionalityDirectionality and Character CodesDirectionality of CharactersControl Characters for Directionality
Directionality (2/2)
Bidi MirroringDirectionality in HTML and CSS
Directionality of Formatting
Line-Breaking Properties (1/4)Conformance CriteriaCharacters for Special Control over Line BreakingPreventing line breaksSuggesting line break opportunitiesLimited supportPrinciples of Line Breaking
Line-Breaking Properties (2/4)
Emergency BreaksUnicode Line-Breaking RulesValues of the LineBreak propertyThe format of LineBreak.txt
Line-Breaking Properties (3/4)
The formal rulesApplying the rulesPair table implementation
Line-Breaking Properties (4/4)
TailoringSome background and criticism
Unicode Conformance Requirements (1/2)
An Informal SummaryNotations and Terms Used in the RequirementsUnassigned Code PointsInterpretationModificationCharacter Encoding Forms
Unicode Conformance Requirements (2/2)
Character Encoding SchemesBidirectional TextNormalization FormsNormative ReferencesUnicode AlgorithmsDefault Casing Operations
Unicode Standard Annexes
Effects on Choosing CharactersExample: Some Mathematical Operators
Chapter 6. Unicode Encodings
Unicode Encodings in General
UTF-32 and UCS-4
UTF-16 and UCS-2
UCS-2 Is BMP OnlySurrogate Pairs in UTF-16Some Properties of UTF-16
UTF-8
UTF-8 Encoding AlgorithmUTF-8 Versus ISO-8859-1
Some Properties of UTF-8
Byte Order
Conversions Between Unicode Encodings
Other Encodings (1/3)
SCSU CompressionBOCU-1 CompressionCESU-8Modified UTF-8Base64 Encoding of Data
Other Encodings (2/3)
Quoted Printable EncodingUuencodeUTF-7UTF-1UTF-EBCDICGB 18030, “Chinese Unicode”
Other Encodings (3/3)
Punycode, Encoding for Domain NamesURL EncodingIntroduction: URL Encoding for form dataThe original URL EncodingTo encode or not to encode?Generalized URL EncodingModern, UTF-8-based URL Encoding
Auto-Detecting the Encoding
Choosing an EncodingStorage RequirementsEfficiency of ProcessingSpecific LimitationsFavoring UTF-8 on the Internet
Part III. Advanced Unicode Topics
Chapter 7. Characters and Languages
Writing Systems and ITInternationalization (i18n) and Related IssuesAspects of Writing and Their IT ImpactWriting directionWhat does a language setting really set?Setting the Language in Word ProcessingAutomatic operations on punctuationSpelling and grammar checks
Determining the language of text
ExerciseSetting Language Preferences in BrowsersScript = Writing SystemCategories of ScriptsNeed for script information
Scripts and spoofing
Codes and names for scripts
The Script property: the script of a character
Character Requirements of Languages (1/3)
The Impact of Character RepertoireLanguages and CharactersWhat constitutes a character?Does Unicode support all languages?Attempts at technical definitions of character requirementsWhich characters does a language need?
Character Requirements of Languages (2/3)
Language Coverage of ISO Latin AlphabetsExample: SpanishExample: French
Character Requirements of Languages (3/3)
Transliteration and Transcription (1/2)
Solutions to Readers, Problems to ImplementersTransliteration Converts LettersTranscription Converts Sounds
Transliteration and Transcription (2/2)
Phonetic Transcription in IPATranscription Inside a Script?
Language Metadata (1/2)
Need for Language InformationMethods of Determining LanguageLanguage MarkupAttributes for language in HTML and XMLThe impact of language markupGranularity of markup
Language Metadata (2/2)
Language CodesThe confusion of codesISO 639Language codes on the InternetLanguage codes and user interfaces
Language Tags in Unicode
Languages and FontsExample: Shape of the Acute AccentChinese Characters and Language Information
Chapter 8. Character Usage
Basics of Character UsageOrthography Sets Rules for WritingTypography Is About AppearanceLiberal in What You AcceptConservative in What You Send
ASCII (Basic Latin) (1/4)
Names of ASCII CharactersAlphanumeric CharactersParenthesesOther Graphic CharactersAmpersand & (U⁠+⁠0026)Apostrophe ' (U⁠+⁠0027)Asterisk * (U⁠+⁠002A)
ASCII (Basic Latin) (2/4)
Circumflex accent ^ (U⁠+⁠005E)Colon : (U⁠+⁠003A)Comma , (U⁠+⁠002C)Dollar sign $ (U⁠+⁠0024)Commercial at @ (U⁠+⁠0040)Equals sign = (U⁠+⁠003D)Exclamation mark ! (U⁠+⁠0021)Full stop “.” (U⁠+⁠002E)Grave accent ` (U⁠+⁠0060)Greater-than sign > (U⁠+⁠003E)Hyphen-minus “-” (U⁠+⁠002D)Less-than sign < (U⁠+⁠003C)Low line _ (U⁠+⁠005F)
ASCII (Basic Latin) (3/4)
Number sign # (U⁠+⁠0023)Percent sign % (U⁠+⁠0025)Plus sign + (U⁠+⁠002B)Question mark ? (U⁠+⁠003F)Quotation mark " (U⁠+⁠0022)Reverse solidus \ (U⁠+⁠005C)Semicolon ; (U⁠+⁠003B)Solidus / (U⁠+⁠002F)Space “ ” (U⁠+⁠0020)
ASCII (Basic Latin) (4/4)
Tilde ~ (U⁠+⁠007E)Vertical line | (U⁠+⁠007C)ASCII Control Characters (C0 Controls)Control characters or control codes?Types of control charactersVisible symbols for control charactersSummary of C0 Controls
Latin-1 Supplement (ISO 8859-1) (1/2)
Diacritic Marks and Letters with ThemOther LettersSuperscript Digits (¹ ² ³) and Vulgar Fractions (¼ ½ ¾)PunctuationCurrency SymbolsMathematical, Logical, and Physical Symbols
Latin-1 Supplement (ISO 8859-1) (2/2)
Specialized Characters
Other Latin Letters
Other European Alphabetic ScriptsGreek ScriptCyrillic Script
Armenian and Georgian Scripts
Diacritic Marks (1/2)Why Diacritic Marks?Early ApproachesCoded CombinationsCombining Diacritic Marks
Diacritic Marks (2/2)
Variation in Appearance
Spacing Diacritic Marks
Letterlike Symbols
General Punctuation (1/3)
Space CharactersSpaceNo-break space: use it!Fixed-width spaces: rarely usedAdjusting spacing in other ways
General Punctuation (2/3)
Additional no-break space charactersA practical approach to thin spacesDisallowing and allowing line breaksQuotation MarksLanguage-specific quotation marksThe apostrophe versus the single quotation markHyphens and DashesUse of hyphens and dashesThe soft hyphen
General Punctuation (3/3)
MS Word specialtiesEllipsisAngular brackets
Line Structure Control
Different Approaches to Line StructuringLines and RecordsMethods of Coding Line StructureEditors, Word Processors, and Data Transfer
Mathematical and Technical Symbols (1/2)
Superscripts and SubscriptsThe Number Forms BlockRoman numeralsFractionsCharacters in SI NotationsConceptual levels of SI notationsNotes on individual characters
Mathematical and Technical Symbols (2/2)
Letterlike symbols and the SI
Other Blocks (1/2)
Spacing Modifier LettersCurrency SymbolsPhonetic CharactersSpecialsDingbats
Other Blocks (2/2)
Summary of Blocks
Chapter 9. The Character Level and Above
Levels of Text Representation and ProcessingPlain Text, Rich Text, and MarkupPlain textRich text formatsText with markupQuasi-markupConversion to plain textExample: Nonbreaking Hyphen
Example: Formatting in Word Processing
Example: HTML Markup and CSSLinear Text Versus Mathematical NotationsUnicode and MathematicsCharacters Outside the Repertoire
Different workarounds
Using a character versus using a small imageButton-like symbolsUsing an image for esthetic reasonsSelecting the Appropriate Level of Expression
Subscripts and Superscripts
Visual appearance of subscripts and superscriptsReplacement notations for superscripts and subscriptsSuggested policy on subscripting and superscriptingCharacters and Accessibility
Characters in non-visual presentation
Understandability of characters
Explaining characters
Characters and Markup (1/4)Markup and StylingDocument-wide Versus Local DecisionsUnicode Versus MarkupDifferences between markup and plain textCharacters that should not be used in marked-up text
Characters and Markup (2/4)
Formatting characters that may be used in marked-up textCharacters with compatibility mappings
Characters and Markup (3/4)
Preventing Line Breaks
Characters and Markup (4/4)
Breaking the Flow of TextWhy Not Markup in Unicode?
Media Types for Text (1/2)
The Type textThe Character EncodingThe text Type Versus the application TypeSubtypes of text
Media Types for Text (2/2)
Chapter 10. Characters in Internet Protocols
Information About Encoding
What Happens Without Information About EncodingApproaches to Specifying the EncodingPractical RecommendationsLooking at the Headers
Characters in MIME (1/5)
Media TypesCharacter Encoding (“charset”) InformationMIME HeadersInternet message format and MIME
Characters in MIME (2/5)
Headers related to charactersHeaders for transfer encodingThe Quoted-Printable (QP) transfer encodingHow MIME should workTroubleshooting Examples
Characters in MIME (3/5)
Character Encoding on the WebHeaders in HTTPSpecifying the encoding in HTTP headersWhich encodings can be used?HTTP versus HTML
Characters in MIME (4/5)
Checking the HTTP headersServer configurationUsing a meta tagResolution of conflictsThe effect of XHTML
Characters in MIME (5/5)
Heuristics of detecting encodingWhich encoding should I use?Avoiding the encoding problemThe “Unicode Encoded” logo
Content Negotiation and Multilingual Sites (1/3)
Introduction to Multilingual Web SitesParallel versions in different languagesPages with a mix of languagesLanguage negotiation: automatic selection of versionLanguage versus countryLinks to Language VersionsWriting Link Texts
Content Negotiation and Multilingual Sites (2/3)
Language Negotiation in the HTTP ProtocolLanguage Negotiation: the Server SideUsing MultiviewsUsing type-mapWhen negotiation failsLanguage Negotiation: the Browser SideNotes on Multilingual Sites
Content Negotiation and Multilingual Sites (3/3)
Producing the translationsTranslation or different content?Indicating what is available in each languageNaming the versionsLanguage preferences and JavaScriptMaking use of language preferences in CGI scripts
Types of Negotiation
Characters in Protocol HeadersThe Signature Convention May HelpThe Q EncodingThe B EncodingSummary: Dealing with Non-ASCII Characters in Headers
Characters in Domain Names and URLs
Internationalized Domain Names (IDN)The IDNA implementationSecurity threatsCharacters in URLs
Chapter 11. Characters in Programming
Characters in Computer LanguagesCommon Escape NotationsCharacters in Markup Languages and CSSCharacters in HTML and XML
Problems in generating markup programmatically
Problems in using scripts inside HTMLCharacters in CSS
Identifiers in CSS
Character and String Data (1/5)
Constructs and Principles of Processing CharactersThe FORTRAN Model: Hollerith DataThe C modelThe character data typeStrings as arrays8-bit characters and sign extensionThe EOF indicatorThe zero byte (NUL byte) convention
Character and String Data (2/5)
The null pointerConfusion around NUL, NULL, and relativesC and UnicodeUnicode with 8-bit Quantities?Wide CharactersWin32 APIs
Character and String Data (3/5)
Multibyte Character Sets (MBCS) Versus UnicodeThe Perl ModelStrings and characters in PerlThe catenation operator “.”In Perl, double quotes mean evaluationNotations for Unicode charactersUsing properties of charactersECMAScript (JavaScript)String orientedThe ECMAScript standardUTF-16 impliedThe \u escape notation
Character and String Data (4/5)
PHP: Mostly Just 8 BitsJava: Rich Support to UnicodeCharacters, strings, objects, and methodsEncodings and escape notations16-bit charactersJava identifiersLibrary routines
Character and String Data (5/5)
The Preparedness Principle (1/2)
Being Prepared for Amount of DataBeing Prepared for Content of DataMethods of handling unexpected charactersDisplaying unrecognized or undisplayable code pointsDefault ignorable code pointsTable-Driven Versus Property-Driven Processing
The Preparedness Principle (2/2)
Naïve Processing
Character Input and Output (1/2)Character-Oriented and Line-Oriented ProcessingPerl I/OJava File I/O
Character Input and Output (2/2)
Buttons for Character Input
Processing Form DataDecoding Form DataRecognizing the EncodingAvoid Oddities by Using UTF-8Using UTF-8
Submitting a File
Identifiers, Patterns, and Regular Expressions (1/4)IdentifiersIdentifiers: internal or external?Traditional format of identifiersCase sensitivityThe Unicode approach to identifiersPatternsIdentifier and Pattern Characters
Identifiers, Patterns, and Regular Expressions (2/4)
Identifier SyntaxNormalizationCase foldingIdentifiers (names) in XML
Identifiers, Patterns, and Regular Expressions (3/4)
Alternative Identifier SyntaxPattern SyntaxRegular ExpressionsRegexp use in programmingRegexp use by end usersUnicode regular expressionsBasic Unicode support
Identifiers, Patterns, and Regular Expressions (4/4)
Examples
International Components for Unicode (ICU)
Using Locales (1/3)
The Locale ConceptCLDRCLDR versus Unix/Linux/POSIX locale concept
Using Locales (2/3)
Using CLDRInternationalization and LocalizationCLDR Description and Data
Using Locales (3/3)
Problems with Aspects of Localization
Appendix. Tables for Writing Characters (1/4)
Appendix. Tables for Writing Characters (2/4)
Appendix. Tables for Writing Characters (3/4)
Appendix. Tables for Writing Characters (4/4)
Additional Notes
CoverageOrderingSpecific Notes
Mapping from Symbol Font to Unicode (1/2)
Mapping from Symbol Font to Unicode (2/2)
Index (1/6)
Index (2/6)
Index (3/6)
Index (4/6)
Index (5/6)
Index (6/6)

Content preview from Unicode Explained

Some Properties of UTF-8

Due to the algorithm, the octets appearing in UTF-8 are limited to certain ranges, as

shown in Table 6-2. In particular, octets C0 and C1 and F5 through FF do not appear

in UTF-8. Other octets may appear in specific contexts only. This means that if you

have a large file that is not, in fact, character data in UTF-8 and you try to read it as

UTF-8, it is most probable that errors will be signaled.

Table 6-2. Octet ranges in UTF-8

Code range Octet 1 Octet 2 Octet 3 Octet 4

U+0000..U+007F 00..7F

U+0080..U+07FF C2..DF 80..BF

U+0800..U+0FFF E0 A0..BF 80..BF

U+1000..U+CFFF E1..EC 80..BF 80..BF

U+D000..U+D7FF ED 80..9F 80..BF

U+E000..U+FFFF EE..EF 80..BF 80..BF

U+10000..U+3FFFF F0 90..BF 80..BF 80..BF

U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF

U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

Similarly to UTF-16, UTF-8 makes it impossible to access the nth character of a string

directly. UTF-8 is robust, though: if a code unit is corrupted, other characters will be

processed correctly. The reason is that UTF-8 has been designed so that a code unit

starting the representation of a character can be recognized as such, even if the pre-

ceding code unit is in error.

Although the authoritative definition of UTF-8 is in the Unicode standard, with content

as described here, there is also a description of UTF-8 as an Internet standard, STD 63.

It is currently RFC 3629, “UTF-8, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 059610121XCatalog Page Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Unicode Explained

by Jukka K. Korpela

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Unicode Demystified

Fonts & Encodings

Java™ Data Objects

SAS Encoding

Publisher Resources