
Characters with ambiguous semantics have General Category values that are meant to
reflect their typical use in normal text. Thus, for example, hyphen-minus is classified
as “Punctuation, dash,” although it is often used as a mathematical symbol.
Use of General Category in Programming
To illustrate the use of this property in programming, let us consider the following
simple task: read a text file and print all lines that contain an uppercase (capital) letter.
Using a modern version of the Perl programming language, with Unicode support, you
can do this with a three-liner (which could be written as a one-liner if you like):
while(<>) {
if (m/\p{Lu}/) {
print; }}
This program contains a loop that reads an input line and prints if the condition m/…/
is true—i.e., if a substring of the input line matches the expression between the slashes.
The Unicode thing here is the expression, \p{Lu}, which by definition matches any
character whose General Property value is Lu. This covers Latin uppercase letters with
or without diacritic marks (A, Â, etc.) as well as Greek, Cyrillic, and other uppercase
letters. An approach that uses the character properties is of course much simpler than
writing program code that tests all the different possibilities separately. Whether the
broad concept of “uppercase letter” corresponding to the General Property value Lu is
really adequate in a particular situation depends ...