Dealing with Non-Western European Input
An HTML form
can be used for input in languages
other than Western European, but the charset discussed earlier comes
into play here as well. First of all, when you create a page with a
form for entering non-Western European characters, you must tell the
browser which charset should be used for the user input. One way to
give the browser this information is to hardcode a charset name as
part of the
attribute of the
page directive, as in Figure 14-4:
<%@ page pageEncoding="Shift_JIS" contentType="text/html;charset=UTF-8" %>
The user can then enter values with the characters of the corresponding language (e.g., Japanese symbols).
But there’s something else to be aware of here. When the user submits the form, the browser first converts the form-field values to the corresponding byte values for the specified charset. It then encodes the resulting bytes according to the HTTP standard URL encoding scheme, the same way special characters such as space and semicolon are converted when an ISO-8859-1 encoding is used. The bytes for all characters other than ISO-8859-1 a-z, A-Z, and 0-9, are encoded as the byte value in hexadecimal format, preceded by a percent sign. For instance, the symbols for “Hello World” in Japanese are sent like the following if the charset for the form is set to UTF-8:
This code represents the URL-encoded UTF-8 byte codes for the five Japanese symbols (three bytes for each ...