Chapter 20. Internationalization and Localization

As mentioned in “Text”, strings in PHP are sequences of bytes. A byte can have up to 256 possible values. This means that representing text that only uses English characters (the US-ASCII character set) is straightforward in PHP, but you must take extra steps to ensure that processing text that contains other kinds of characters works properly.

The Unicode standard defines how computers encode the thousands and thousands of possible characters you can use. In addition to letters such as ä, ñ, ž, λ, ד, د, and ド, the standard also includes a variety of symbols and icons. The UTF-8 encoding defines what bytes represent each character. The easy English characters are each represented by only one byte. But other characters may require two, three, or four bytes.

You probably don’t have to do anything special to ensure your PHP installation uses UTF-8 for text processing. The default_charset configuration variable controls what encoding is used, and its default value is UTF-8. If you are having problems, make sure default_charset is set to UTF-8.

This chapter tours the basics of successfully working with multibyte UTF-8 characters in your PHP programs. The next section, “Manipulating Text”, explains basic text manipulations, such as calculating length and extracting substrings. “Sorting and Comparing” shows how to sort and compare strings in ways that respect different languages’ rules for the proper order of characters. ...

Get Learning PHP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.