Glyphs and Fonts
It is important to distinguish the character concept from the glyph concept. A glyph
is a presentation of a particular shape a character may have when rendered or displayed.
It has even been said that any character is an abstract idea, whereas glyphs for the
character are its different visible manifestations.
Each character we use in English normally has the same basic shape, and glyphs for it
differ in typographic design only. It is obvious that Tin the Times font represents
the same character as Tin the Arial font, for example. However, the letter “a” has
two rather different shapes (compare “a” in normal Times font and ain Times italic).
When you write literally by hand, you may draw characters differently in different
positions of a word. For example, a word-final “s” may be quite different than a word-
initial “s.” In typewritten or typeset text, or in text displayed or printed on computers,
such distinctions are not made, even in so-called handwriting-style fonts.
In Greek writing, a word-final sigma (ς) is rather different from a normal small sigma
(σ), although they are logically the same character. The first and last letter of the word
σοφός (sophos, “wise”) are the same but are written differently. However, since this
is a special case, character codes usually solve this by encoding them as two separate
characters, and Unicode follows suit, even without defining any equivalence between
them.
In other writing systems, the variation can be much bigger, especially if the writing
systems imitate handwriting. In Arabic, letters have two or four contextual forms, which
can be quite different from each other. Figure 1-5 shows the four forms of an Arabic
letter, usually called “ba” or more exactly bāʾ, though the Unicode name is Arabic letter
beh (U+02BE). The forms are (from right to left!) for use as isolated, at the start of a
word, in the middle of a word, and at the end of a word. As you can see, for example,
the word-final form (on the left) has a part that helps in joining the character with the
previous character. Each of these forms, in turn, can appear differently in different fonts.
In the ISO-8859-6 character code (Latin/Arabic), for example, each Arabic letter has
one code position only. This leaves it to rendering engines to determine the context
(position within a word) and to use the correct contextual form. Unicode, on the other
hand, contains both such characters (effectively, taken from ISO-8859-6) and each of
the contextual forms as a separately coded character. This lets you write Arabic so that
the rendering process can be very simple, at the cost of extra work in writing. However,
even using Unicode, you are normally supposed to use the more abstract Arabic letters.
It is ultimately a matter of definition whether two graphic presentations are glyphs for
the same character or distinct characters. However, it is normally not an individual’s
Figure 1-5. The four contextual forms of the Arabic letter “ba”
30 | Chapter 1:Characters as Data

Get Unicode Explained now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.