Dealing with Non-Western European Input

An HTML form can be used for input in languages other than Western European, but the charset discussed earlier comes into play here as well. First of all, when you create a page with a form for entering non-Western European characters, you must tell the browser which charset should be used for the user input. One way to give the browser this information is to hardcode a charset name as part of the contentType attribute of the page directive, as in Figure 14-4:

<%@ page pageEncoding="Shift_JIS" 
  contentType="text/html;charset=UTF-8" %>

The user can then enter values with the characters of the corresponding language (e.g., Japanese symbols).

But there’s something else to be aware of here. When the user submits the form, the browser first converts the form-field values to the corresponding byte values for the specified charset. It then encodes the resulting bytes according to the HTTP standard URL encoding scheme, the same way special characters such as space and semicolon are converted when an ISO-8859-1 encoding is used. The bytes for all characters other than ISO-8859-1 a-z, A-Z, and 0-9, are encoded as the byte value in hexadecimal format, preceded by a percent sign. For instance, the symbols for “Hello World” in Japanese are sent like the following if the charset for the form is set to UTF-8:

%E4%BB%8A%E6%97%A5%E3%81%AF%E4%B8%96%E7%95%8C

This code represents the URL-encoded UTF-8 byte codes for the five Japanese symbols (three bytes for each ...

Get JavaServer Pages, 3rd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.