book

CJKV Information Processing, 2nd Edition

Name: CJKV Information Processing, 2nd Edition
Author: Ken Lunde
ISBN: 9780596800925

by Ken Lunde

December 2008

Intermediate to advanced

912 pages

33h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface (1/2)
Preface (2/2)
Chapter 1: CJKV Information Processing Overview
Writing Systems and Scripts
Character Set Standards
Encoding Methods
Data Storage Basics
Input Methods
Typography
Basic Concepts and Terminology FAQ
What Are All These Abbreviations and Acronyms?

What Are Internationalization, Globalization, and Localization?
What Are the Multilingual and Locale Models?
What Is a Locale?
What Is Unicode?
How Are Unicode and ISO 10646 Related?What Are Row-Cell and Plane-Row-Cell?
What Is a Unicode Scalar Value?
Characters Versus Glyphs: What Is the Difference?
What Is the Difference Between Typeface and Font?
What Are Half- and Full-Width Characters?
Latin Versus Roman Characters
What Is a Diacritic Mark?What Is Notation?
What Is an Octet?
What Are Little- and Big-Endian?
What Are Multiple-Byte and Wide Characters?
Advice to Readers
Chapter 2: Writing Systems and Scripts
Latin Characters, Transliteration, and Romanization
Chinese Transliteration Methods
Japanese Transliteration Methods (1/2)
Japanese Transliteration Methods (2/2)
Korean Transliteration Methods
Vietnamese Romanization Methods
Zhuyin/Bopomofo
Kana
Hiragana
Katakana
The Development of Kana
Hangul
Ideographs
Ideograph Readings
The Structure of Ideographs
The History of Ideographs
Ideograph Simplification
Non-Chinese Ideographs
Japanese-Made Ideographs—Kokuji
Korean-Made Ideographs—Hanguksik Hanja
Vietnamese-Made Ideographs—Chữ Nôm
Chapter 3: Character Set Standards
NCS Standards
Hanzi in China
Hanzi in Taiwan
Kanji in Japan
Hanja in Korea
CCS Standards
National Coded Character Set Standards Overview
ASCII
ASCII Variations
CJKV-Roman
Chinese Character Set Standards—China (1/4)
Chinese Character Set Standards—China (2/4)
Chinese Character Set Standards—China (3/4)
Chinese Character Set Standards—China (4/4)
Chinese Character Set Standards—Taiwan (1/3)
Chinese Character Set Standards—Taiwan (2/3)
Chinese Character Set Standards—Taiwan (3/3)
Chinese Character Set Standards—Hong Kong (1/2)
Chinese Character Set Standards—Hong Kong (2/2)
Chinese Character Set Standards—Singapore
Japanese Character Set Standards
Korean Character Set Standards (1/2)
Korean Character Set Standards (2/2)
Vietnamese Character Set Standards
International Character Set Standards
Unicode and ISO 10646 (1/5)
Unicode and ISO 10646 (2/5)
Unicode and ISO 10646 (3/5)
Unicode and ISO 10646 (4/5)
Unicode and ISO 10646 (5/5)
GB 13000.1-93
CNS 14649-1:2002 and CNS 14649-2:2003
JIS X 0221:2007
KS X 1005-1:1995
Character Set Standard Oddities
Duplicate Characters
Phantom Ideographs
Incomplete Ideograph Pairs
Simplified Ideographs Without a Traditional Form
Fictitious Character Set Extensions
Seemingly Missing Characters
CJK Unified Ideographs with No SourceVertical Variants
Noncoded Versus Coded Character Sets
China
Taiwan
Japan
Korea
Information Interchange and Professional PublishingCharacter Sets for Information Interchange
Character Sets for Professional and Commercial Publishing
Future Trends and Predictions
Emoji
Genuine Ideograph Unification
Advice to Developers
The Importance of Unicode
Chapter 4: Encoding Methods
Unicode Encoding Methods
Special Unicode Characters
Unicode Scalar Values
Byte Order Issues
BMP Versus Non-BMP
Unicode Encoding Forms
Obsolete and Deprecated Unicode Encoding Forms (1/2)
Obsolete and Deprecated Unicode Encoding Forms (2/2)
Comparing UTF Encoding Forms with Legacy Encodings
Legacy Encoding Methods
Locale-Independent Legacy Encoding Methods
Locale-Specific Legacy Encoding Methods (1/4)
Locale-Specific Legacy Encoding Methods (2/4)
Locale-Specific Legacy Encoding Methods (3/4)
Locale-Specific Legacy Encoding Methods (4/4)
Comparing CJKV Encoding Methods
Charset Designations
Character Sets Versus Encodings
Charset Registries
Code Pages
IBM Code Pages
Microsoft Code Pages
Code Conversion
Chinese Code Conversion
Japanese Code Conversion
Korean Code Conversion
Code Conversion Across CJKV Locales
Code Conversion Tips, Tricks, and Pitfalls
Repairing Damaged or Unreadable CJKV Text
Quoted-Printable Transformation
Base64 Transformation
Other Types of Encoding Repair
Advice to Developers
Embrace Unicode
Legacy Encodings Cannot Be Forgotten
Testing
Chapter 5: Input Methods
Transliteration Techniques
Zhuyin Versus Pinyin Input
Kana Versus Transliterated Input
Hangul Versus Transliterated Input
Input Techniques
The Input Method
The Conversion Dictionary
Input by Reading
Input by Structure
Input by Multiple Criteria
Input by Encoding
Input by Other Codes
Input by Postal Code
Input by Association
User Interface Concerns
Inline ConversionKeyboard Arrays
Western Keyboard Arrays
Ideograph Keyboard Arrays
Chinese Input Method Keyboard Arrays
Zhuyin Keyboard Arrays
Kana Keyboard Arrays (1/2)
Kana Keyboard Arrays (2/2)
Hangul Keyboard Arrays
Latin Keyboard Arrays for CJKV Input
Mobile Keyboard Arrays (1/2)
Mobile Keyboard Arrays (2/2)
Other Input Hardware
Pen Input
Optical Character Recognition
Voice Input
Input Method Software
CJKV Input Method Software
Chinese Input Method Software
Japanese Input Method Software
Korean Input Method Software
Chapter 6: Font Formats, Glyph Sets, and Font Tools
Typeface Design
How Many Glyphs Can a Font Include?
Composite Fonts Versus Fallback Fonts
Breaking the 64K Glyph Barrier
Bitmapped Font Formats
BDF Font Format
HBF Font Format
Outline Font Formats
PostScript Font Formats (1/4)
PostScript Font Formats (2/4)
PostScript Font Formats (3/4)
PostScript Font Formats (4/4)
TrueType Font Formats
OpenType—PostScript and TrueType in Harmony (1/2)
OpenType—PostScript and TrueType in Harmony (2/2)
Glyph Sets
Static Versus Dynamic Glyph Sets
CID Versus GID
Std Versus Pro Designators
Glyph Sets for Transliteration and Romanization
Character Collections for CID-Keyed Fonts (1/3)
Character Collections for CID-Keyed Fonts (2/3)
Character Collections for CID-Keyed Fonts (3/3)
Ruby Glyphs
Generic Versus Typeface-Specific Ruby Glyphs
Host-Installed, Printer-Resident, and Embedded Fonts
Installing and Downloading Fonts
The PostScript Filesystem
Mac OS X
Mac OS 9 and Earlier
Microsoft Windows—2000, XP, and Vista
Microsoft Windows—Versions 3.1, 95, 98, ME, and NT4
Unix and Linux
X Window System
Font and Glyph Embedding
Cross-Platform Issues
Font Development Tools
Bitmapped Font Editors
Outline Font Editors
Outline Font Editors for Larger Fonts
AFDKO—Adobe Font Development Kit for OpenType
TTX/FontTools
Font Format Conversion
Gaiji Handling
The Gaiji Problem
SING—Smart INdependent Glyphlets
Ideographic Variation Sequences
XKP, A Gaiji Handling Initiative—Obsolete
Adobe Type Composer (ATC)—Obsolete
Composite Font Functionality Within Applications
Gaiji Handling Techniques and Tricks
Creating Your Own Rearranged Fonts
Acquiring Gaiji Glyphs and Gaiji Fonts
Advice to Developers
Chapter 7: Typography
Rules, Principles, and Techniques
JIS X 4051:2004 Compliance
GB/T 15834-1995 and GB/T 15835-1995
Typographic Units and Measurements
Two Important Points—Literally
Other Typographic Units
Horizontal and Vertical Layout
Nonsquare Design Space
The Character Grid
Vertical Character Variants (1/2)
Vertical Character Variants (2/2)
Dedicated Vertical Characters
Vertical Latin Text
Line Breaking and Word Wrapping
Character Spanning
Alternate Metrics
Half-Width Symbols and Punctuation
Proportional Symbols and Punctuation
Proportional Kana
Proportional Ideographs
Kerning
Line-Length Issues
Manipulating Symbol and Punctuation Metrics
Manipulating Inter-Glyph Spacing
JIS X 4051:2004 Character Classes
Multilingual Typography
Latin Baseline Adjustment
Proper Spacing of Latin and CJKV Characters
Mixing Latin and CJKV Typeface Designs
Glyph Substitution
Character and Glyph Variants
Ligatures
Annotations
Ruby Glyphs
Inline Notes—Warichu
Other Annotations
Typographic Applications
Page-Layout Applications (1/2)
Page-Layout Applications (2/2)
Graphics Applications
Advice to Developers
Chapter 8: Output Methods
Where Can Fonts Live?
Output via Printing
PostScript CJKV Printers
Genuine PostScript
Clone PostScript
Passing Characters to PostScript
Output via Display
Adobe Type Manager—ATM
SuperATM
Adobe Acrobat and PDF
Ghostscript
OpenType and TrueType
Other Printing Methods
The Role of Printer Drivers
Microsoft Windows Printer Drivers
Mac OS X Printer Drivers
Output Tips and Tricks
Creating CJKV Documents for Non-CJKV Systems
Advice to Developers
CJKV-Capable Publishing Systems
Some Practical Advice
Chapter 9: Information Processing Techniques
Language, Country, and Script Codes
CLDR—Common Locale Data Repository
Programming Languages
C/C++
Java
Perl
Python
Ruby
Tcl
Other Programming Environments
Code Conversion Algorithms
Conversion Between UTF-8, UTF-16, and UTF-32
Conversion Between ISO-2022 and EUC
Conversion Between ISO-2022 and Row-Cell
Conversion Between ISO-2022-JP and Shift-JIS
Conversion Between EUC-JP and Shift-JIS
Other Code Conversion Types
Java Programming ExamplesJava Code Conversion
Java Text Stream Handling
Java Charset Designators
Miscellaneous Algorithms
Japanese Code Detection
Half- to Full-Width Katakana Conversion—in Java
Encoding Repair
Byte Versus Character Handling
Character Deletion
Character Insertion
Character Searching
Line Breaking
Character Attribute Detection Using C Macros
Character Sorting
Natural Language Processing
Word Parsing and Morphological Analysis
Spelling and Grammar Checking
Chinese-Chinese Conversion
Special Transliteration Considerations
Regular Expressions
Search Engines
Code-Processing Tools
JConv—Code Conversion Tool
JChar—Character Set Generation Tool
CJKV Character Set Server
JCode—Text File Examination Tool
Other Useful Tools and Resources
Chapter 10: OSes, Text Editors, and Word Processors
Viewing CJKV Text Using Non-CJKV OSes
AsianSuite X2—Microsoft Windows
NJStar CJK Viewer—Microsoft WindowsTwinBridge Language Partner—Microsoft WindowsOperating Systems
FreeBSD
Linux
Mac OS X
Microsoft Windows Vista
MS-DOS
Plan 9
Solaris and OpenSolaris
TRON and Chokanji
Unix
Hybrid Environments
Boot Camp—Run Windows on Apple Hardware
CrossOver Mac—Run Windows Applications on Mac OS XGNOME—Linux and Unix
KDE—Linux and Unix
VMware Fusion—Run Windows on Mac OS XWine—Run Windows on Unix, Linux, and Other OSesX Window System—Unix
Text Editors
Mac OS X Text Editors
Windows Text Editors
Vietnamese Text Editing
Emacs and GNU Emacs
vi and Vim
Word Processors
AbiWord
Haansoft Hangul—Microsoft WindowsIchitaro—Microsoft WindowsKWordMicrosoft Word—Microsoft Windows and Mac OS X
Nisus Writer—Mac OS X
NJStar Chinese/Japanese WP—Microsoft Windows
Pages—Mac OS X
Online Word ProcessorsAdobe BuzzwordGoogle Docs
Advice to Developers
Chapter 11: Dictionaries and Dictionary Software
Ideograph Dictionary Indexes
Reading Index
Radical Index
Stroke Count Index
Other Indexes
Ideograph Dictionaries
Character Set Standards As Ideograph Dictionaries
Locale-Specific Ideograph Dictionaries
Vendor Ideograph Dictionaries and Ideograph Tables
CJKV Ideograph Dictionaries
Other Useful DictionariesConventional Dictionaries
Variant Ideograph Dictionaries
Dictionary Hardware
Dictionary Software
Dictionary CD-ROMs
Frontend Software for Dictionary CD-ROMs
Dictionary Files (1/2)
Dictionary Files (2/2)
Frontend Software for Dictionary Files
Web-Based Dictionaries
Machine Translation Applications
Machine Translation Services
Free Machine Translation Services
Commercial Machine Translation Services
Language-Learning Aids
Chapter 12: Web and Print Publishing
Line-Termination Concerns
Email
Sending Email
Receiving Email
Email Troubles and Tricks
Email Clients
Network Domains
Internationalized Domain Names
The CN Domain
The HK Domain
The JP Domain
The KR Domain
The TW Domain
The VN Domain
Content Versus Presentation
Web Publishing
Web Browsers
Displaying Web Pages
HTML—HyperText Markup Language
Authoring HTML Documents (1/2)
Authoring HTML Documents (2/2)
Web-Authoring Tools
Embedding CJKV Text As GraphicsXML—Extensible Markup Language
Authoring XML Documents
CGI Programming Examples
Print Publishing
PDF—Portable Document Format
Authoring PDF Documents
PDF Eases Publishing Pains
Where to Go Next?
Appendix A: Code Conversion Tables
Appendix B: Notation Conversion Table
Appendix C: Perl Code Examples (1/4)
Appendix C: Perl Code Examples (2/4)
Appendix C: Perl Code Examples (3/4)
Appendix C: Perl Code Examples (4/4)
Appendix D: Glossary (1/8)
Appendix D: Glossary (2/8)
Appendix D: Glossary (3/8)
Appendix D: Glossary (4/8)
Appendix D: Glossary (5/8)
Appendix D: Glossary (6/8)
Appendix D: Glossary (7/8)
Appendix D: Glossary (8/8)
Appendix E: Vendor Character Set Standards
Appendix F: Vendor Encoding Methods
Appendix G: Chinese Character Sets—China
Appendix H: Chinese Character Sets—Taiwan
Appendix I: Chinese Character Sets—Hong Kong
Appendix J: Japanese Character Sets
Appendix K: Korean Character Sets
Appendix L: Vietnamese Character Sets
Appendix M: Miscellaneous Character Sets
Bibliography (1/6)
Bibliography (2/6)
Bibliography (3/6)
Bibliography (4/6)
Bibliography (5/6)
Bibliography (6/6)
Index (1/6)
Index (2/6)
Index (3/6)
Index (4/6)
Index (5/6)
Index (6/6)

Content preview from CJKV Information Processing, 2nd Edition

610

Chapter 9: Information Processing Techniques

hangul instead of hanja, and the intervening spaces help in the eort to parse them into

the constituent parts.

Fujitsu Laboratories in Japan had developed a Japanese morphological analyzer called

Breakfast that had the ability to parse Japanese text into morphemes, and had a customi-

zable POS

(part-of-speech) system. is customizable feature enabled Breakfast to use

the dictionaries of JUMAN ( juman)

†

and ChaSen ( chasen),

‡

both of which

still seem to be available in some form, though Breakfast seems to have gone away. An-

other called Sumomo ( sumomo), developed by NTT, also seems to have gone away.

Chances are, Breakfast and Sumomo were either sold and renamed, or simply renamed.

In addition to JUMAN and ChaSen, other more current Japanese morphological analyz-

ers include MeCab ( mekabu)

and KAKASI (Kanji Kana Simple Inverter).

Without a doubt, Basis Technology’s Rosette Base Linguistics for Asian Languages, which

provides a morphological analyzer that supports Chinese, Japanese, and Korean, is one

of the top-performing libraries of its kind.

e fact that Google and Amazon use it states

something about its eectiveness. Its abilities include segmentation, tokenization, noun

decompounding, part-of-speech tagging, sentence boundary detection, and other analyz-

ing functions. ey also provide related soware, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780596156114Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

CJKV Information Processing, 2nd Edition

by Ken Lunde

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Just Java™ 2

How to Overcome a Power Deficit

The Human Factor in AI-Based Decision-Making

Tips for Designing Effective Presentation Slide Decks

Publisher Resources