Characters with ambiguous semantics have General Category values that are meant to
reflect their typical use in normal text. Thus, for example, hyphen-minus is classified
as “Punctuation, dash,” although it is often used as a mathematical symbol.
Use of General Category in Programming
To illustrate the use of this property in programming, let us consider the following
simple task: read a text file and print all lines that contain an uppercase (capital) letter.
Using a modern version of the Perl programming language, with Unicode support, you
can do this with a three-liner (which could be written as a one-liner if you like):
while(<>) {
if (m/\p{Lu}/) {
print; }}
This program contains a loop that reads an input line and prints if the condition m/…/
is true—i.e., if a substring of the input line matches the expression between the slashes.
The Unicode thing here is the expression, \p{Lu}, which by definition matches any
character whose General Property value is Lu. This covers Latin uppercase letters with
or without diacritic marks (A, Â, etc.) as well as Greek, Cyrillic, and other uppercase
letters. An approach that uses the character properties is of course much simpler than
writing program code that tests all the different possibilities separately. Whether the
broad concept of “uppercase letter” corresponding to the General Property value Lu is
really adequate in a particular situation depends on the context and application.
An Overview of Properties
For overview and quick-reference purposes, we will present an alphabetic table of
properties here, followed by a list of explanations of the meanings of the properties.
Many of the concepts used there will be explained later, or need to be consulted from
the Unicode material, for issues that are too specialized to be discussed in this book.
The word “property” can have several meanings. For example, the shape of a character
can be regarded as its property, and so can a statement about its use. However, in
Unicode, the word “property” normally refers to formally defined properties. Often the
definition is given as a table that lists characters and values of the property for each
The overall structure is described in the document “Unicode Character Database,” The Unicode Character Data-
base (UCD) itself is a collection of plain text files in fixed, well-defined formats, which
are suitable to automated processing. These files are available at addresses that begin
with, and they specify the values of prop-
erties for each character, either by explicitly assigning a value or by implying a default
An Overview of Properties | 219

Get Unicode Explained now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.