case be displayed each as two or more characters, which have no direct relationship
with the real character in the data. This is because consecutive octets would be inter-
preted as each indicating a character, instead of being treated according to the encoding
as a unit.
The “Character Set” Confusion
Character encodings are often called character sets, and the abbreviation charset is used
in Internet protocols to denote a character encoding. This is confusing because people
often understand “set” as “repertoire.” However, character set means a very specific
internal representation of characters, and for the same repertoire, several different
“character sets” can be used. A character set implicitly defines a repertoire, though: the
collection of characters that can be represented using the character set.
It is advisable to avoid the phrase “character set” when possible. The term character
code can be used instead when referring to a collection of characters and their code
numbers. The term character encoding is suitable when referring to a particular repre-
For example, the word “ASCII” can mean a certain collection of characters, or that
collection along with their code numbers 0–127 as assigned in the ASCII standard, or
even more concretely, those code numbers (and hence the characters) represented using
an 8-bit byte for each character.
Working with Encodings
When you use characters on a computer, some software will internally encode them in
binary format. Most users never need to know the details of this, still less need to
actually handle the encoding process, but it is essential to know that there are different
encodings, with different properties. In transferring data between applications and
computers, you may need to change the encoding or select a suitable encoding.
Selecting the Encoding When Saving
Text editors and many other programs typically have a File menu, with a Save function
for storing data onto disk. Normally, this function uses the file format and the character
encoding that is typical of the program. However, there is usually also a Save As func-
tion, which lets the user select the format and encoding. This function is often used
because it lets you save an edited document under a different filename.
The Save As function is often the simplest way to convert between different encodings
(and file formats). You simply open a file and save it differently. For example, suppose
you have used Notepad to create a plain text file. If you use, for example, an English
version of Windows, the default encoding that Notepad uses is Windows Latin 1. Now
suppose that a friend has asked you to send your text in the UTF-8 encoding for some
50 | Chapter 1:Characters as Data
reason. You simply open your file in Notepad, select File Save
As and then choose
the UTF-8 encoding from the menu of encodings, as shown in Figure 1-13. It illustrates
the three basic things you can (and need to) specify in Save As dialogs: the filename,
the file format, and the encoding.
The list of possible encodings in a Save As dialog varies greatly, and the names of the
encodings are not always official names. For example, in Microsoft products, “ANSI”
often appears as denoting the character code that the system uses as its normal 8-bit
code, such as the Windows Latin 1 encoding, which should be called “windows-1252.”
The word “Unicode” may denote different encodings used for Unicode, typically
UTF-16. Use the UTF-8 encoding for Unicode text, unless you have a good reason for
doing otherwise.
When using a text-processing program, the situation is usually different. There is a file
format menu in the Save As dialog but often no encoding menu. The reason is that in
text processing, the overall format is crucial, and the encoding is often coupled with
the format.
In Microsoft Word, for example, the list of formats may contain alternatives as shown
in Figure 1-14, with options corresponding to the internal formats of different programs
and some plain text formats. Here, too, it may require some guesswork or study to
identify what the options really mean. On Windows systems, “*.txt” is associated with
several different encodings, and “*.ans” refers to ANSI (e.g., windows-1252). The no-
tation “*.asc” may suggest ASCII encoding, but in fact it refers to an old DOS encoding,
a code page, which is a single-octet encoding and may vary from one system to another.
Figure 1-13. An extract from a Save As dialog in Notepad
Figure 1-14. An extract from a Save As dialog in Microsoft Word
Working with Encodings | 51

Get Unicode Explained now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.