© Copyright IBM Corp. 2002. All rights reserved. 43
Chapter 9. Linguistic services
People fundamentally communicate ideas and concepts through the use of natural language.
The encoding of these ideas into specific languages such as English or Chinese is analogous
to the encoding of the text that represents them into character sets such as ISO 8859-1 or
Big5. We can view information processing at two distinct levels—the character-based
processing required to read and display text, and the linguistic-based processing that
interrogates text at the language level in order to identify the various properties of words.
Linguistic services are required by more sophisticated global applications to address the
challenges presented by the growing body of electronic multilingual data on both the World
Wide Web and corporate intranets. Some of the key global e-business solution components
are possible only through the application of linguistic services (for example, voice as
input/output and linguistically sensitive searches).
Linguistic services for a particular language generally require an extensive understanding of
that language. In most cases, processing is language-sensitive. Even though many
algorithms used are language-independent, they are driven by language-specific data. In
other cases, both the algorithms and data are language-specific. Situations also exist where a
particular technology is language-independent (for example, in clustering), although these
situations are rare and usually depend on lower-level services that are themselves
language-specific (such as segmentation). Finally, some types of languages have unique
linguistic features not shared by others, thus requiring specific support for those features.
General low-level linguistic tools
Linguistic services encompass a variety of technologies, including:
򐂰 Spell checking, which verifies that the spellings of words are correct. The concept of
misspelling does not apply to ideographic languages such as Chinese or Korean in the
same sense as to orthographic languages. For ideographic languages, input is controlled
by an Input Method Editor (IME). Any character generated by the IME is valid and
represents a real word. The issue for these languages is not one of orthographic validity,
but of identifying grammatical or semantical mistakes that occur from accidental use of a
mistakenly selected word or character. This is similar to the situation in English, where it is
possible to misspell one word as another valid word so that only a grammar check or
9
44 e-Business Globalization Solution Design Guide
statistical analysis against common mistakes will reveal that this orthographically correct
word is in fact a misspelling and thus contextually incorrect.
Spell checking technology is widely used in word processors.
Figure 9-1 Spell checker helps to verify the spellings of the wordings
򐂰 Grammar checking, the process of verifying that sentence structure is valid according to a
set of rules. These rules form the grammar and are language-specific. As previously
discussed, grammar checking can be used to find those combinations of characters valid
in spelling but contextually incorrect. For example, someone might erroneously use a
character that looks or sounds similar to the intended character. An English example
might be “I red (meaning read) the book last night.” Today's word processors incorporate
this technology. Grammar checking plays a key role in spell checking for ideographic
languages. It is also used in the very important area of
disambiguation. Closely related to
grammar checking is the grammatical parsing used in parsing queries for natural
language question-answering applications.
򐂰 Hyphenation, a very common practice using the hyphen symbol to compose words, as
well as for such things as constructing line-break boundaries. Hyphenation is meaningless
in Chinese, Japanese, and Korean.

Get e-business Globalization Solution Design Guide: Getting Started now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.