Comparing and Sorting Unicode Text
When you use Perl’s built-in sort
or cmp operators, strings are not compared alphabetically.
Instead, the numeric codepoint of each character in one string is
compared with the numeric codepoint of the corresponding character in
the other string. This doesn’t work so well on text where some letters
are shared between languages and other letters are peculiar to each
language. It’s not just letters that have misordered
codepoints—numbers and other supposedly contiguous sequences can do
that, too, because some were added to the character sets when they
were small, and others were added after the character sets grew, like
Topsy. For instance, squares and cubes were added to Latin-1 early on.
Notice how they sort early, too:
use v5.14; use utf8; my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ ); @exes = sort @exes; say "@exes"; # prints: x² x³ x¹ x⁰ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹
Because codepoint order does not correspond to alphabetic order,
your data will come out in an order that, while not exactly random,
isn’t what someone looking for a lexicographic sort wants. The default
sort is good mostly for providing
fast access to an ordering that will be the same every time, even
though it isn’t usefully alphabetic. It’s just deterministic.
Sometimes that’s good enough, but other times…
Enter the standard Unicode::Collate module, which implements the Unicode Collation
Algorithm (UCA), a highly customizable, multilevel sort specifically designed for Unicode data. The module ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access