BUY THIS BOOK
Add to Cart

Print Book $59.99


Add to Cart

Print+PDF $77.99

Add to Cart

PDF $47.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £42.50

What is this?

Looking to Reprint or License this content?


Unicode Explained
Unicode Explained By Jukka K. Korpela
June 2006
Pages: 678

Cover | Table of Contents


Table of Contents

Chapter 1: Characters as Data
Computers were originally built to process numbers. Over the last few decades, they've become increasingly better at handling text as well, but the transition from human scribbling and beautiful typography to bits and bytes has been complicated. Going from a paper document to a computerized representation of that document means learning about how the computer handles text, and requires learning about characters, character codes, fonts, and encodings. Unicode provides a set of solutions for some of these problems, while retaining presentation flexibility for making text look as we feel it should.
Computer programs use two basic data types in most of their processing: characters and numbers. These basic types are combined in various ways to create strings, arrays, records, and other data structures. (Inside the computer, characters are numbers, but the ways that these numbers are handled is very different from numbers meant for calculation.)
Early computers were largely oriented toward numerical computation. However, characters were used early on in administrative data processing, where names, addresses, and other data needed to be stored and printed as strings. Text processing on computers became more common much later, when computers had become so affordable that they replaced typewriters. At present, most text documents are produced and processed using computers.
Originally, character data on computers had limited types and uses. For economic and technical reasons, the repertoire of characters was very small, not much more than the letters, digits, and basic punctuation used in normal English. This constitutes but a tiny fraction of the different characters used in the world’s writing systems—about 100 characters out of literally myriads (tens of thousands) of characters. Thus, there was a growing need for a possibility of presenting and handling a large character repertoire on computers; Unicode is the fundamental answer to that.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction to Characters and Unicode
Computer programs use two basic data types in most of their processing: characters and numbers. These basic types are combined in various ways to create strings, arrays, records, and other data structures. (Inside the computer, characters are numbers, but the ways that these numbers are handled is very different from numbers meant for calculation.)
Early computers were largely oriented toward numerical computation. However, characters were used early on in administrative data processing, where names, addresses, and other data needed to be stored and printed as strings. Text processing on computers became more common much later, when computers had become so affordable that they replaced typewriters. At present, most text documents are produced and processed using computers.
Originally, character data on computers had limited types and uses. For economic and technical reasons, the repertoire of characters was very small, not much more than the letters, digits, and basic punctuation used in normal English. This constitutes but a tiny fraction of the different characters used in the world’s writing systems—about 100 characters out of literally myriads (tens of thousands) of characters. Thus, there was a growing need for a possibility of presenting and handling a large character repertoire on computers; Unicode is the fundamental answer to that.
Since you are reading this book, I assume you already have sufficient motivation to learn about Unicode. Nevertheless, a short presentation follows that explains the benefits of Unicode.
Computers internally work on numbers. This means that characters need to be coded as numbers. A typical arrangement is to use numbers from 0 to 255, because that range fits into a basic unit of data storage and transfer, called a (8-bit) byte
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What’s in a Character?
We use characters daily: we type them, and we read them on screen or on paper. We use text-processing programs routinely, much like people used to use typewriters, pens, or other writing tools. How could characters create problems?
If English is your native language, you are accustomed to using a small set of characters, consisting of the letters A–Z and a–z, digits 0–9, and a few punctuation characters. Most novels, newspaper articles, and memos contain no other characters. Since you seem to be able to type these characters directly on a keyboard, why should you learn more about characters and get confused? To be honest, character issues are confusing.
Suppose you use a computer only to write and edit texts in English, perhaps as a secretary or a technical editor. You still have reasons to know about characters:
  • Computer technology has caused a decline in typography, and you can make a positive impression by using correct punctuation instead of typewriter-style punctuation. If you use a text-processing program, it probably takes care of using “smart” quotation marks instead of "straight" quotes, but you need to learn how to produce dashes—like this—and how to prevent bad line breaks.
  • Normal English texts may contain special characters occasionally. Someone may spell Caesar as Cæsar, or use a word like fiancé, rôle, or garçon the French way, or use the per mille sign ‰ or the euro sign €. Michael Everson writes: “Despite unfounded but widespread belief to the contrary (based doubtless on the prevalence of ASCII), diacritics (usually French ones) are often found in naturalized English words. Examples are: à la carte, abbé, Ægean, archæology, belovèd, café, décor, détente, éclair, façade, fête, naïve, naïvety (but cf. non-naturalized naïveté), noël, œsophagus, résumé, vicuña” (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Variation of Writing Systems
The most widely used writing systems, or scripts, can be classified as follows:
Alphabetic scripts
Denote sounds with letters, though usually not in a strict one-to-one manner. Examples: Latin, Greek, and Cyrillic scripts, each of which exists in different versions.
Consonant scripts, or abjads
Basically denote consonants, leaving vowels to be inferred; however, consonant scripts may have letters for long vowels, and in some situations even short vowels are written using small signs attached to consonants. Examples: Hebrew and Arabic scripts.
Abugida scripts
These use consonant letters that imply a particular vowel after the consonant, when used in the base form. Alternatives with other vowels or without any vowel are indicated by additional marks. Many South and Southeast Asian scripts belong to this category—e.g., the Devanagari script used for many Indic languages.
Syllabic scripts
Use basically one character for each syllable. Examples: the Hiragana and Katakana scripts, used for Japanese.
Ideographic scripts
Use basically one character for one (short) word. The most widely known ideographic script is Han, often known as Chinese script, though it is also used (in part) for other languages as well, especially Japanese and Korean, and therefore often called “CJK.”
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Glyphs and Fonts
It is important to distinguish the character concept from the glyph concept. A glyph is a presentation of a particular shape a character may have when rendered or displayed. It has even been said that any character is an abstract idea, whereas glyphs for the character are its different visible manifestations.
Each character we use in English normally has the same basic shape, and glyphs for it differ in typographic design only. It is obvious that “T” in the Times font represents the same character as “T” in the Arial font, for example. However, the letter “a” has two rather different shapes (compare “a” in normal Times font and “a” in Times italic). When you write literally by hand, you may draw characters differently in different positions of a word. For example, a word-final “s” may be quite different than a word-initial “s.” In typewritten or typeset text, or in text displayed or printed on computers, such distinctions are not made, even in so-called handwriting-style fonts.
In Greek writing, a word-final sigma (ς) is rather different from a normal small sigma (σ), although they are logically the same character. The first and last letter of the word σοφός (sophos, “wise”) are the same but are written differently. However, since this is a special case, character codes usually solve this by encoding them as two separate characters, and Unicode follows suit, even without defining any equivalence between them.
In other writing systems, the variation can be much bigger, especially if the writing systems imitate handwriting. In Arabic, letters have two or four contextual forms, which can be quite different from each other. shows the four forms of an Arabic letter, usually called “ba” or more exactly bāʾ, though the Unicode name is Arabic letter beh (U⁠+⁠02BE). The forms are (from right to left!) for use as isolated, at the start of a word, in the middle of a word, and at the end of a word. As you can see, for example, the word-final form (on the left) has a part that helps in joining the character with the previous character. Each of these forms, in turn, can appear differently in different fonts.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Definitions of Character Repertoires
The implementation of Unicode support is a long and mostly gradual process. Unicode can be supported by programs on any operating systems, although some systems may allow much easier implementation than others; this mainly depends on whether the system uses Unicode internally so that support to Unicode is built in.
Even in circumstances where Unicode is supported in principle, the support usually does not cover all Unicode characters. For example, an available font may cover some part of Unicode that is only practically important in some area. When text data produced in one program is to be processed in another, we should be prepared for difficulties with any unusual characters. For data transfer, it is essential to know which Unicode characters the recipient is able to handle.
Thus, although Unicode contains a huge number of characters, not all of them can be used safely. Among the 100,000 or so characters, usually only a small subset can be used in a particular application and context without a serious risk of distorting information.
Each character code, by itself, defines a character repertoire: the collection of characters that can be represented in the code. In addition to this, subsets of such collections can be defined.
A character repertoire is any collection of characters, without implying any particular implementation even at the level of code numbers. However, in practice, the simplest way to define a character repertoire is to use Unicode as the basis and simply list the code numbers. Such a definition specifies a closed collection, which does not change if the Unicode standard is enhanced. In contrast, by listing a set of Unicode blocks you define an open collection, which is fixed at any given moment of time but will automatically expand if new characters are added to any of those blocks in a revision of the Unicode standard.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Numbering Characters
Definitions in character standards assign a number to each character. The numbers are unique in each standard, but different standards assign the numbers differently. Some commonly used standards are mutually compatible, in part: the numbers of characters in ASCII (ranging from 0 to 127) are the same as in the ISO 8859 standards, and the numbers of characters in ISO 8859-1 (ranging from 0 to 255) are the same as in Unicode.
The numbers are nonnegative integers 0, 1, 2,…, but are not necessarily consecutive; there can be gaps in the assignment. For example, in ISO 8859 standards, numbers in the range 128 to 159 are unassigned; more specifically, they are reserved for control purposes, leaving it up to other standards to define them. Unicode contains a lot of gaps, due to the coding structure, partly in order to leave space for future extensions.
It might sound natural to use the first few code numbers for digits 0, 1,…, but character standards use different assignments. Don’t expect to find much logic in it. The code number of a character should be treated as fairly arbitrary, but fixed.
The number assigned to a character in a character standard has many different names: code number, code position, code value, code element, code point, code set value, as well as simply code. In the Unicode standard, the term “code point” is used both about a number and about a location in the coding space where a character could reside. Some code points are allocated for characters, a few have been explicitly designated as not corresponding to characters (now or ever), and most code points are still not assigned in any way.
Since characters are internally represented by their code numbers, a character can also be treated as an integer. In fact, many old programming languages lack a data type for characters and use an integer type instead. However, the code numbers are usually not used in arithmetic operations, since they mostly lack numeric meaning. If a character’s number is smaller than another character’s number, this by no means implies a corresponding relation in alphabetic order. For some small regions of code numbers, the order actually corresponds to alphabetic order, though.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Encoding Characters as Octet Sequences
When we need to store character data on a computer, we might consider storing it in an exact visual shape. Some people would call this a very naive idea, but it is in fact quite feasible, even necessary—for some purposes. If you have an old manuscript to be stored digitally, you need to scan it with high resolution and store it in some image format. Sometimes you would do that for individual characters as well. On web pages, for example, it is common to use images containing text for logos, menu items, buttons, etc., in order to produce a particular visual appearance.
For most processing of texts on computers, however, we need a more abstract presentation. It would be highly impractical to work on scanned images of characters in storing and transferring text, not to mention comparing strings for example. We do not want to do the process of recognizing a character’s identity every time we use the character. Instead, we use characters as atoms of information, identified by their code numbers or some other simple way. This is really what “abstract characters” are about.
Plain text is a technical term that refers to data consisting of characters only, with no formatting information such as font face, style, color, or positioning. However, formatting such as line breaks and simple spacing using space characters may be included, to the extent that it can be expressed using control characters only. Moreover, all characters are to be taken as such, without interpreting them as formatting instructions or tags. For example, HTML or XML is not plain text.
Plain text is a format that is readable by human beings when displayed as such. The reader needs to know the human language used in the text, of course. The display of plain text depends on the font that happens to be used. This can often be changed within a program, but such settings change the font of all text. (As an exception, if the font chosen does not contain all the characters used in the text, a clever program might use other fonts as backup for missing characters.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Working with Encodings
When you use characters on a computer, some software will internally encode them in binary format. Most users never need to know the details of this, still less need to actually handle the encoding process, but it is essential to know that there are different encodings, with different properties. In transferring data between applications and computers, you may need to change the encoding or select a suitable encoding.
Text editors and many other programs typically have a File menu, with a Save function for storing data onto disk. Normally, this function uses the file format and the character encoding that is typical of the program. However, there is usually also a Save As function, which lets the user select the format and encoding. This function is often used because it lets you save an edited document under a different filename.
The Save As function is often the simplest way to convert between different encodings (and file formats). You simply open a file and save it differently. For example, suppose you have used Notepad to create a plain text file. If you use, for example, an English version of Windows, the default encoding that Notepad uses is Windows Latin 1. Now suppose that a friend has asked you to send your text in the UTF-8 encoding for some reason. You simply open your file in Notepad, select File → Save As and then choose the UTF-8 encoding from the menu of encodings, as shown in . It illustrates the three basic things you can (and need to) specify in Save As dialogs: the filename, the file format, and the encoding.
Figure 1-13: An extract from a Save As dialog in Notepad
The list of possible encodings in a Save As dialog varies greatly, and the names of the encodings are not always official names. For example, in Microsoft products, “ANSI” often appears as denoting the character code that the system uses as its normal 8-bit code, such as the Windows Latin 1 encoding, which should be called “windows-1252.” The word “Unicode” may denote different encodings used for Unicode, typically UTF-16. Use the UTF-8 encoding for Unicode text, unless you have a good reason for doing otherwise.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Working with Fonts
In a word processor like Microsoft Word, it is deceptively simple to change the overall font, or the font of some particular piece of text. You can paint a piece of text with the mouse and select a font for it from a drop-down menu. In web authoring, it is not much more difficult, especially if you use authoring software that resembles a word processor. However, things become difficult if the chosen font does not contain all the characters you need.
Each computer system is shipped with some repertoire of fonts, which may be insufficient for working with a large character repertoire even if the system is basically “Unicode enabled.”
For example, a typical Windows system might not have any font that is rich enough to present all the characters you need. Unfortunately, Windows has often been preinstalled without full “multilingual support.” You may therefore need to install additional fonts.
On Windows XP, you would do this as follows:
  1. Select Start → Control Panel → Regional Options and Language Options.
  2. In the “Languages” tab, there are checkboxes for two groups of languages, “complex scripts and right-to-left languages” and “East Asian languages.” Check either or both of them to install optional fonts and system support for these languages. You will be informed about disk requirements and asked to confirm. You might be prompted to insert the Windows CD-ROM or point to a network location where the files are located.
On older Windows systems, you may need to select Control Panel → Add/Remove Programs, click on Multilanguage Support, and then Details. Make sure a checkmark appears beside the language or languages you want to use, and then click on OK.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summaries
The following summaries use very concise language, and they are hardly understandable in isolation. However, having read the text of this chapter, you may find them useful and return to them later. The terminology related to characters varies quite a lot, so the summaries help in checking out how this book names things.
Following is a list of terms you may come across:
Character
A basic unit of textual information, as abstract concept, as opposed to stylistic and typographic variation between shapes that can be identified as the same character.
Character code
A mapping, often presented in tabular form, that defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers.
Character encoding
A method (algorithm) for presenting characters in digital form by mapping sequences of code numbers of characters into sequences of octets. Encodings have names, which can be registered.
Code number
The integer assigned to a character in a character code. Synonyms: code position, code value, code element, code point, code set value, code.
Character repertoire
A collection of distinct characters. No specific internal presentation in computers or data transfer is assumed. The repertoire per se does not even define an ordering for the characters; ordering for sorting and other purposes is to be specified separately. A character repertoire is usually defined by specifying names of characters and a sample (or reference) presentation of characters in visible form.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Writing Characters
The practical difficulties of producing characters on normal computer keyboards are among the most serious obstacles to more widespread use of rich character repertoires. Most modern computers have rather good Unicode support, but people don’t make use of it, because they simply don’t know how to type special characters.
This chapter presents some common methods of entering characters. It is largely a collection of recipes, useful to people who work daily with texts containing “difficult” characters. Appendix A gives a quick reference for commonly needed characters.
The topic is also relevant to IT specialists who need to understand the possible input methods when designing applications and systems. The same applies to giving instructions on data entry, or simply asking someone to send you in writing (on paper or in digital form) something that contains characters that are “special” to him. It is not sufficient to know some way of typing characters, since users may not have the same methods at their disposal, or they might find it too awkward.
There is no single answer to a question like “How do I write the character…?” The methods vary by program and equipment. In any given situation, there are usually several ways to write a character.
When you give individual instructions to someone, or you are solving your own problem with typing characters, you should normally try to find one way to input the characters, preferably the most convenient one. However, as usual, convenience is relative. It does not pay off to find a clever way of producing a character if you need it only once and you already know a general, if clumsy, way to input that character. When you give general instructions to many people, especially to people who work in different environments, you should try to explain a few alternatives. It is quite probable that different people need or like different methods.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Method Varieties
There is no single answer to a question like “How do I write the character…?” The methods vary by program and equipment. In any given situation, there are usually several ways to write a character.
When you give individual instructions to someone, or you are solving your own problem with typing characters, you should normally try to find one way to input the characters, preferably the most convenient one. However, as usual, convenience is relative. It does not pay off to find a clever way of producing a character if you need it only once and you already know a general, if clumsy, way to input that character. When you give general instructions to many people, especially to people who work in different environments, you should try to explain a few alternatives. It is quite probable that different people need or like different methods.
There are many different methods for typing characters, often available in parallel. Some of them are very general, allowing even the insertion of any Unicode character. Some methods have been tailored for very special purposes, perhaps even for the entry of one particular character that would otherwise be difficult to produce. This chapter aims at clarifying things by explaining typical approaches. The multitude of methods can be divided into a few basic categories, to make things more understandable.
When you select methods to be explained to users of an application, it is usually best to aim at systematic ways rather than the fastest ways. That is, opt for a method that works for all the characters needed rather than an eclectic combination of tricks. The same may apply to your own use, e.g., when you need to type particular characters frequently.
Appendix A contains a collection of methods for some commonly needed characters. For casual use, pick up whatever works for you and suits you. For more regular use, it is better to analyze the needs and to make some choices.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Keyboard Variation and Settings
Understanding the effects of keyboard variation is essential, because you may need to work with different keyboards, or you may need to write instructions for people who use different keyboards. If you design computer applications for a potentially worldwide market, you need to make them work in wide range of environments. Even a simple form on a web page might be a computer application in this sense.
Typing characters on a computer may appear deceptively simple: you press a key labeled “A,” and the character “A” appears on the screen. Well, you actually get uppercase “A” or lowercase “a” depending on whether you used the Shift key or not, but that’s common knowledge. You also expect “A” to be included into a disk file when you save what you are typing, you expect “A” to appear on paper if you print your text, and you expect “A” to be sent if you send your text by email or something like that. Moreover, you expect the recipient to see an “A.”
It has hopefully become clear from the previous discussion that the representation of a character in computer storage or disk or in data transfer may vary a lot. You have probably realized that especially if it’s not the common “A” but something more special, like an “A” with an accent—say, À—strange things might happen, especially if data is not accompanied with adequate information about its encoding.
You might still be too confident. You probably expect that on your system at least things are simpler than that. If you use your very own very personal computer and press the key labeled “A” on its keyboard, then shouldn’t it be evident that in its storage and processor, on its disk, and on its screen it’s invariably “A”? Can’t you just ignore its internal character code and character encoding? Well, probably yes—with “A.” Don’t be so sure about À, for instance. On a typical PC, for example, try this: create a file containing À in Notepad and then open the command-line interface (DOS prompt) and display the file using the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Virtual Keyboards
In several systems, including MS Windows, it is possible to switch between different keyboard settings. This means that the effects of different keys do not necessarily correspond to the engravings in the key caps but to some other assignments. This way you can turn your keyboard to French, Greek, Russian, or another language just by clicking on an icon at the bottom of the screen and selecting a setting from a menu, as shown in . What you need to do is to enable the “language support” you need. It is called language support, but the relevant part is keyboard settings. We will not go into details here, since the techniques depend on the version of Windows.
Figure 2-3: Changing keyboard settings from Finnish (code FI) to Spanish (code ES)
MS Windows has some keyboard shortcuts for switching between different keyboard layouts. If you right-click on the language indicator in the toolbar, you can access settings that control such shortcuts. They typically involve the Alt and Shift keys. It is convenient to be able to switch between two layouts simply by pressing Alt and Shift simultaneously, if you know how this works. It is less convenient to do such things by mistake and find yourself using an odd-behaving keyboard where, for example, pressing the “-” key produces “/” and you have no idea how to fix that. Therefore, avoid installing keyboard layout options on other people’s computers without informing them.
For example, if you write in English but frequently need Greek letters, you can install Greek keyboard settings. You would then learn how to switch between the settings—for example, by using Alt and Shift keys. To type the letter pi (π), you would do the switch, press the “P” key, and switch back to your normal keyboard settings.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Program Commands
We often need program-specific ways of entering characters from a keyboard, either because there is no key for a character we need or there is but it does not work. The program involved might be part of system software, or it might be an application program. We describe here some typical cases.
In typical computer systems, you can copy data from one program to another through an internal storage area called the clipboard. On Windows, you can usually highlight text with the mouse or select a piece of text otherwise, and then press Ctrl-C to copy, click on a location in another window, and press Ctrl-V to paste a copy of the text there. This also works inside a program of course, so you can use it to create copies of a character or a string.
This feature is well known by most users and often very convenient, though it cannot be the primary method of writing text. You can however copy characters from web pages or from text documents specifically designed for use as “cliptext.”
Often this technique has the property of copying text formatting along with the text. If you copy bold 16-point Verdana text from Excel to Word, you get 16-point Verdana text, not text in the normal font as defined by your Word settings or template. This might be desirable, but more often, it is a problem. Moreover, constructs like hypertext links may get copied along with the text. To make sure that only the plain text is inserted, you can first paste the text in Notepad, select it again there, press Ctrl-C, and paste in the desired destination.
Programs may have command menus for inserting characters, so that characters are identified by some names or glyphs. At the simplest, you just select a command and a subcommand from a menu. Usually it is more complicated, to allow the insertion of more characters that can conveniently be included into a command menu.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Character Maps
A character map as an input method is an array of images of characters where you can click on an image and have the character inserted into your data. A click might immediately insert the character in the current point of insertion in some window. More often, a click selects the character, and then you click on a button to do something with it. A character map acts as a selection table, a menu arranged as a table.
Different programs have different character maps, ranging from a simple one (typically containing just 256 positions) to a full Unicode table with many extra features. Of course, only a small range—e.g., 80 positions—of Unicode characters can be visible at any given moment.
Old Windows systems have a rather primitive character map, which you can launch by selecting Start → Programs → Utilities → System utilities → Character Map. On newer systems, the character map is much more powerful, but the method of starting it is equally clumsy and hard to find if you did not know about it.
Let us first consider the character map in MS Word in an old Windows system. The system’s character map being primitive, Word offers more. As mentioned before, you launch the map using the command Insert → Symbol to invoke an auxiliary window, where the initial pane Symbols contains a map, as in .
Figure 2-11: Character map in an old version of MS Word
Newer systems have more powerful character maps, but even the old interface has the basic functionality you need to insert any Unicode character:
  1. Select a font from the Font menu. This is essential because the map usually shows only those characters that appear in the chosen font. Arial Unicode MS contains a relatively large subset of Unicode.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Replacements on the Fly
A program may process your input so that it is immediately changed as you type. This is usually based on assumptions on what people really want to type but cannot type directly, due to keyboard limitations. Sometimes such features are very convenient, sometimes really frustrating, if the user dride or undo their effects.
Word processors often modify user input so that when you have typed, for example, the three characters (c), the program changes that string, both internally and visibly, to the single character ©. This substitution is often convenient, especially if you can add your own rules for modifications. On the other hand, it causes unpleasant surprises and problems when you actually meant what you wrote—e.g., you wanted to write letter “c” in parentheses.
Use Ctrl-Z as the immediate cure to an undesired on-the-fly conversion in MS Word. If you are uncertain of what happened, use Edit → Undo instead (since Word will show which operation will be undone).
In MS Word, there are several automatic conversions like the one described above. They can be modified: you can remove conversions that you regard as annoying, and you can add conversions of your own.

Viewing and changing the rules

There are many different settings in MS Word, and their organization is not always what we might expect. In the Tools menu, as shown in , the Customize and Options commands lead to various settings, but the automatic replacements are found via the command AutoCorrect. Having selected the command, you get a new window, where the first pane is as in .
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Special Techniques
General techniques that let you type any Unicode character are often impractical when you need to write a large number of characters of some particular kind. More specialized techniques are often more convenient. Moreover, some characters cannot be written just b

Return to Unicode Explained