
Code-Processing Tools
|
615
I encourage you to study the Perl code examples found in Appendix C to learn more about
how regexes can be used to manipulate CJKV text.
For more information on regular expressions, I highly suggest Jerey Friedl’s Mastering
Regular Expressions, ird Edition (O’Reilly Media, 2006).
*
X/Open Guide: Internationali-
sation Guide (X/Open Company Limited, 1993) also includes a chapter on international-
ized regular expressions.
Search Engines
One important function of the Web is the ability to conduct searches for particular items.
While most of the popular search engines accept only ASCII characters (or regexes that
reect ASCII text) for this task, there are now a number of CJKV-capable search engines
available.
e toughest issues faced by CJKV-capable search-engine developers include the
following:
e proper handling of multiple-byte characters that appear in the search string, in-•
cluding multiple-byte support for regexes.
e proper handling of multiple encodings for both the search string and searched •
text (for example, a user-entered Korean EUC-KR search string must be able to match
in documents encoded according to EUC-KR and ISO-2022-KR encodings), which
eectively means that the encodings for the search string and searched text must be
regularized (because the searched text may be large, it is much easier to regularize the
search string to m ...