applicable syntax. For example, the rules of a programming language might restrict the
character repertoire in identifier names to letters, digits, and one or two other charac-
ters. On the other hand, the underscore (low line) character _ is often usable in names,
and it normally works reliably.
The Misnomer “8-bit ASCII”
The phrase “8-bit ASCII” is used surprisingly often. It follows from the discussion in
the previous section that in reality ASCII is strictly and unambiguously a 7-bit code in
the sense that all code positions are in the range 0–127. It can be, and it usually is,
represented using 8-bit bytes, but with the first bit always zero, or used for other pur-
poses so that it is not part of the encoded form of a character.
The misnomer “8-bit ASCII” most often denotes windows-1252, the 8-bit code defined
by Microsoft for use in the Western world. More generally, 8-bit ASCII is used to refer
to various character codes, which are extensions of ASCII and mutually more or less
incompatible. The character repertoire in such a code contains ASCII as a subset, the
code numbers are in the range 0–256, and the code numbers of ASCII characters equal
their ASCII codes.
ISO 8859 Codes
ISO 8859—or more formally, ISO/IEC 8859—is a family of character code standards.
They were largely developed by Ecma, which distributes ECMA standards that are
equivalent to ISO 8859 standards. ISO 8859 standards are largely oriented toward lan-
guages of European origin.
ISO 8859 codes are widely used on different platforms and in different contexts. For
example, on the Web, ISO 8859-1 was long treated as the default encoding. On Win-
dows, ISO 8859 as such is not used that much, but the corresponding, somewhat ex-
tended Windows encodings are common. In Unix and Linux, ISO 8859 is very com-
mon.
Each ISO 8859 standard tries to address the needs of one or more specific languages
and cultural environment, within the fairly narrow framework of 8-bit structure. This
means that in most cases, you cannot represent multilingual text using any single ISO
8859 encoding.
ISO 8859-1 (ISO Latin 1)
The international standard ISO 8859-1 defines a character repertoire identified as Latin
alphabet No. 1, commonly called ISO Latin 1, as well as a character code for it. The
repertoire contains the ASCII repertoire as a subset, and the code numbers for those
characters are the same as in ASCII. The standard also specifies an encoding, which is
similar to that of ASCII: each code number is presented simply as one octet.
124 | Chapter 3:Character Sets and Encodings

Get Unicode Explained now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.