4.2. Displaying Foreign and Special Characters


You need to display words, or entire pages of text, in a language other than the primary one used by your site and audience.


Use a <meta> right after the <head> tag to declare a character set on all the pages on your site:

	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Or, alternatively, override or modify your server's default character sets with an .htaccess file that specifies a different character set for a particular file:

	<Files "russian.html">
	 AddCharset windows-1252 .html


It's safe to assume that your web site and its audience have a primary language, whether its English, Greek, Japanese, or something else. You create the pages in that language and web surfers view those pages without giving much thought to how the site appears in their language.

Behind the scenes, though, character sets enabled on your web server and on your visitors' browser are making sure everything looks as it should. Offering basic, multi-lingual site information—for example, "About Us" pages in Russian, Japanese, and Arabic—presents a problem to sites that are otherwise served and intended to be viewed with a character set that does not include the required characters for the other languages. Only one character set can be used on a web page, and it can't be changed mid-page. Before getting into the problem of mixing words and pages from other languages and alphabets into your web site, though, let me give you a quick overview of how character sets work on the Web.

The first widely used character set for electronic documents was American Standard Code for Information Interchange (ASCII), created in the late 1950s and formally defined for the first time in 1963. It assigned machine-readable codes to the upper-and lowercase Roman alphabet, punctuation marks, and control characters such as line feeds and tabs. There are 128 characters in what is now called the US-ASCII character set.

In the mid 1980s, the European Computer Manufacturer's Association (ECMA) expanded and improved ASCII with the introduction of a handful of 256-character sets that cover the languages and alphabets of Europe and the Middle East, from Iceland to Yemen and everything in between. Endorsed by the International Standards Organization (ISO), each character set from what has come to be known as the ISO-8859 family, retains the first 128 characters of ASCII while adding special characters unique to languages such as Arabic or Cyrillic in the second half of the set. Many, if not most, English-language web sites use the ISO-8859-1 character set, also known as Latin 1. Characters from the latter half of any ISO-8859 set, which include symbols and accented characters, should be encoded as named or numerical entities to ensure their proper display. For example, an "é" would be represented in HTML code as &eacute; or &#233;.


The ampersand character (&) marks the beginning of special named or numerical entity codes for special characters in HTML. The number sign (#) precedes the numerical code for a character entity (and follows the &). Both named and numerical entities end with a semicolon. To show a literal ampersand character on a web page, convert it to its numerical (&#38;) or named (&amp;) entity to prevent browsers from misinterpreting it as the start of a character entity.

Unicode represents a great leap forward in the internationalization of electronic communication. As the Unicode web site (http://www.unicode.org) puts it, "When the world wants to talk, it speaks Unicode." At nearly 100,000 characters in the recently released Version 4, Unicode incorporates all the characters from the various ISO-8859 sets, and then some. And, conveniently, the first 256 characters are a one-to-one match with the Latin 1 character set.

All of these character sets and several others are available to web browsers and web servers. Your Apache web server may have one or more character sets enabled in its configuration file. Web browsers use the default character set defined in their preferences settings, although most can switch to another available character set when instructed to do so by the Content-Type HTTP header sent to the web browser by the web server before the rest of the page.

That's how it's supposed to work—ideally. But in shared hosting environments from which sites in a variety of languages may be served, the web server may not send the correct Content-Type header. Or it may not send one at all. You can play it safe, though, by specifying the character set with a <meta> tag on every web page.

Here is the structure of a <meta> tag for displaying a page's contents using the Latin 1 (ISO-8859-1) character set:

	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

If you need to mix languages on every page, say, by displaying "About Us" links translated to the various languages for which you offer content in the site's main navigation, then you should use the Unicode character set.

Given Unicode's vast repertoire of characters from nearly all the world's languages, plus its overlap with Latin 1, your English language content will not need any special treatment when you specify Unicode for your pages. Special characters from other languages can be encoded as Unicode decimal entities for proper display. Your web page editor may offer a function for encoding characters as Unicode entities. If not, online resources listed in the "See Also" section in this Recipe are there to help. The Unicode character set <meta> tag looks like this:

	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

However, two problems—one big, one small—can occur using the <meta> tag method for specifying Unicode content. First, the web server character set configuration (if the web server actually does send one in the HTTP header) trumps a document's setting, so a browser might not shift its character set to display the page properly, even when told to do so by a <meta> tag.

If your <meta>-tagged pages don't look right, you might need to override your server's default character set for a directory or specific file in a directory with an .htaccess file. Add this line to the .htaccess file you create or modify in your site's root directory:

	AddType 'text/html; charset=utf-8' .html

Alternatively, you also can modify the server character set for specific files in a directory. For example, your directory of "About Us" pages might have an index.html file in English, and translated versions named russian.html, japanese.html, and arabic.html, along with an .htaccess file that instructs the server to change the character set for a given file based on its name, like this:

	<Files "russian.html">
	 AddCharset windows-1252 .html
	<Files "japanese.html">
	 AddCharset Shift_JIS .html
	<Files "arabic.html">
	 AddCharset iso-8859-6 .html

The <meta> tag method also can cause a small problem in older browsers when they've already received a character set from the web server for a page. The second character set setting (in the <meta> tag) can cause older browsers to draw the page twice, which appears to visitors as an annoying screen flicker.

To minimize this glitch, the <meta> tag declaration of a web page's character set should always be on the first line following the <head> tag:

	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Declaring a document's character set with a <meta> tag also can be useful when the web server is out of the picture, such as when pages are to be viewed offline, either from a CD or locally from a user's hard drive.

Finally, bear in mind that pages that mix languages and alphabets may require some effort on the part of your visitors to display correctly. Even if your pages include properly encoded entities for special characters—for example, a &#224 for the aleph character in ISO-8859-8—a web browser may not be able to display those characters if the font to display Hebrew characters is not enabled on the user's system. Similarly, site visitors may need to manually override their browser's default character set to see the content as you intend it to be viewed. If access to the multilingual content is critical to your web site or its audience, consider creating a help page showing a screenshot of the properly rendered page with instructions for users on how to configure their browsers and systems to achieve the same result.

See Also

The Unicode organization has several FAQs about using its vast character repertoire on web pages at http://www.unicode.org/faq/unicode_web.html. At FileFormat.info you can search for character entities in a number of different sets: http://www.fileformat.info/info/charset.

Get Web Site Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.