Locale Sorting
Although the default UCA works well for English and a lot of other languages—including Irish Gaelic, Indonesian, Italian, Georgian, Dutch, Portuguese, and German (except in phonebooks!)—it needs some modification to work the way speakers of many other languages expect their alphabets to sort. Or nonalphabets, as the case may be.
For example, the Nordic languages place some of their letters with diacritics after z instead of next to the regular letters. Even Spanish does things a little differently: the ñ isn’t considered a regular n with a tilde on it the way ã and õ are in Portuguese. Instead, it’s its own letter (named eñe, of course) that falls after n and before o in the Spanish alphabet. That means these words should sort in this order in Spanish: radio, ráfaga, ranúnculo, raña, rápido, rastrillo. Notice how ranúnculo comes before raña instead of after it.
The way to address locale-specific sorting of Unicode data is
to use the Unicode::Collate::Locale module. It’s part of the Unicode::Collate distribution, so it comes standard with v5.14 and is
included with its companion module if you separately install either
from CPAN.
The only difference in the two
modules’ APIs is that the Unicode::Collate::Locale takes an extra parameter to the
constructor: the locale. As of this
writing, 70 different locales are supported, including
variants like German phonebook (umlauted vowels collate as though
they were the regular vowel plus an e following them), traditional Spanish ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access