MySQL Localisation and International Usage

The Character Set Used for Data and Sorting

By default, MySQL uses the ISO-8859-1 (Latin1) character set with sorting according to Swedish/Finnish. This is the character set suitable in the USA and Western Europe.

All standard MySQL binaries are compiled with --with-extra-charsets=complex. This will add code to all standard programs to be able to handle latin1 and all multi-byte character sets within the binary. Other character sets will be loaded from a character-set definition file when needed.

The character set determines what characters are allowed in names and how things are sorted by the ORDER BY and GROUP BY clauses of the SELECT statement.

You can change the character set with the --default-character-set option when you start the server. The character sets available depend on the --with-charset=charset and --with-extra-charset= list-of-charset | complex | all options to configure, and the character set configuration files listed in SHAREDIR/charsets/Index. See Section 2.3.3.

If you change the character set when running MySQL (which may also change the sort order), you must run myisamchk -r -q on all tables. Otherwise, your indexes may not be ordered correctly.

When a client connects to a MySQL server, the server sends the default character set in use to the client. The client will switch to use this character set for this connection.

One should use mysql_real_escape_string( ) when escaping strings for a SQL query. mysql_real_escape_string( ) is identical to the old mysql_escape_string( ) function, except that it takes the MySQL connection handle as the first parameter.

If the client is compiled with different paths than where the server is installed and the user who configured MySQL didn’t include all character sets in the MySQL binary, one must specify for the client where it can find the additional character sets it will need if the server runs with a different character set than the client.

One can specify this by putting in a MySQL option file:

[client]
character-sets-dir=/usr/local/mysql/share/mysql/charsets

where the path points to the directory in which the dynamic MySQL character sets are stored.

One can force the client to use specific character set by specifying:

[client]
default-character-set=character-set-name

but normally this is never needed.

German character set

To get German sorting order, you should start mysqld with --default-character-set=latin_de. This will give you the following characteristics.

When sorting and comparing strings the following mapping is done on the strings before doing the comparison:

ä  ->  ae
ö  ->  oe
ü  ->  ue
ß  ->  ss

All accented characters are converted to their unaccented uppercase counterparts. All letters are converted to uppercase.

When comparing strings with LIKE the one -> two character mapping is not done. All letters are converted to uppercase. Accents are removed from all letters except: Ü, ü, Ö, ö, Ä, and ä.

Non-English Error Messages

mysqld can issue error messages in the following languages: Czech, Danish, Dutch, English (the default), Estonian, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Norwegian-ny, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, and Swedish.

To start mysqld with a particular language, use either the --language=lang or -L lang options. For example:

shell> mysqld --language=swedish

or:

shell> mysqld --language=/usr/local/share/swedish

Note that all language names are specified in lowercase.

The language files are located (by default) in mysql_base_dir/share/LANGUAGE/.

To update the error message file, you should edit the errmsg.txt file and execute the following command to generate the errmsg.sys file:

shell> comp_err errmsg.txt errmsg.sys

If you upgrade to a newer version of MySQL, remember to repeat your changes with the new errmsg.txt file.

Adding a New Character Set

To add another character set to MySQL, use the following procedure.

Decide if the set is simple or complex. If the character set does not need to use special string collating routines for sorting and does not need multi-byte character support, it is simple. If it needs either of those features, it is complex.

For example, latin1 and danish are simple character sets while big5 or czech are complex character sets.

In the following section, we have assumed that you named your character set MYSET.

For a simple character set do the following:

  1. Add MYSET to the end of the sql/share/charsets/Index file. Assign a unique number to it.

  2. Create the file sql/share/charsets/MYSET.conf. (You can use sql/share/charsets/latin1.conf as a base for this.)

    The syntax for the file is very simple:

    • Comments start with a '#' character and proceed to the end of the line.

    • Words are separated by arbitrary amounts of whitespace.

    • When defining the character set, every word must be a number in hexadecimal format.

    • The ctype array takes up the first 257 words. The to_lower, to_upper, and sort_order arrays take up 256 words each after that.

    Section 4.6.4.

  3. Add the character set name to the CHARSETS_AVAILABLE and COMPILED_CHARSETS lists in configure.in.

  4. Reconfigure, recompile, and test.

For a complex character set do the following:

  1. Create the file strings/ctype-MYSET.c in the MySQL source distribution.

  2. Add MYSET to the end of the sql/share/charsets/Index file. Assign a unique number to it.

  3. Look at one of the existing ctype-*.c files to see what needs to be defined—for example, strings/ctype-big5.c. Note that the arrays in your file must have names like ctype_MYSET, to_lower_MYSET, and so on. This corresponds to the arrays in the simple character set. See Section 4.6.4.

  4. Near the top of the file, place a special comment like this:

    /*
     * This comment is parsed by configure to create ctype.c,
     * so don't change it unless you know what you are doing.
     *
     * .configure. number_MYSET=MYNUMBER
     * .configure. strxfrm_multiply_MYSET=N
     * .configure. mbmaxlen_MYSET=N
     */

    The configure program uses this comment to include the character set into the MySQL library automatically.

    The strxfrm_multiply and mbmaxlen lines will be explained in the following sections. Only include these if you need the string collating functions or the multi-byte character set functions, respectively.

  5. You should then create some of the following functions:

    • my_strncoll_MYSET( )

    • my_strcoll_MYSET( )

    • my_strxfrm_MYSET( )

    • my_like_range_MYSET( )

    Section 4.6.5.

  6. Add the character set name to the CHARSETS_AVAILABLE and COMPILED_CHARSETS lists in configure.in.

  7. Reconfigure, recompile, and test.

The file sql/share/charsets/README includes some more instructions.

If you want to have the character set included in the MySQL distribution, mail a patch to .

The Character Definition Arrays

to_lower[] and to_upper[] are simple arrays that hold the lowercase and uppercase characters corresponding to each member of the character set. For example:

to_lower['A'] should contain 'a'
to_upper['a'] should contain 'A'

sort_order[] is a map indicating how characters should be ordered for comparison and sorting purposes. For many character sets, this is the same as to_upper[] (which means sorting will be case-insensitive). MySQL will sort characters based on the value of sort_order[character]. For more complicated sorting rules, see Section 4.6.5.

ctype[] is an array of bit values, with one element for one character. (Note that to_lower[], to_upper[], and sort_order[] are indexed by character value, but ctype[] is indexed by character value + 1. This is an old legacy to be able to handle EOF.)

You can find the following bitmask definitions in m_ctype.h:

#define _U      01      /* Uppercase */
#define _L      02      /* Lowercase */
#define _N      04      /* Numeral (digit) */
#define _S      010     /* Spacing character */
#define _P      020     /* Punctuation */
#define _C      040     /* Control character */
#define _B      0100    /* Blank */
#define _X      0200    /* heXadecimal digit */

The ctype[] entry for each character should be the union of the applicable bitmask values that describe the character. For example, 'A' is an uppercase character (_U) as well as a hexadecimal digit (_X), so ctype['A'+1] should contain the value:

_U + _X = 01 + 0200 = 0201

String Collating Support

If the sorting rules for your language are too complex to be handled with the simple sort_order[] table, you need to use the string collating functions.

Right now the best documentation on this is the character sets that are already implemented. Look at the big5, czech, gbk, sjis, and tis160 character sets for examples.

You must specify the strxfrm_multiply_MYSET=N value in the special comment at the top of the file. N should be set to the maximum ratio to which the strings may grow during my_strxfrm_MYSET (it must be a positive integer).

Multi-Byte Character Support

If you want to add support for a new character set that includes multi-byte characters, you need to use the multi-byte character functions.

Right now the best documentation on this is the character sets that are already implemented. Look at the euc_kr, gb2312, gbk, sjis, and ujis character sets for examples. These are implemented in the ctype-'charset'.c files in the strings directory.

You must specify the mbmaxlen_MYSET=N value in the special comment at the top of the source file. N should be set to the size in bytes of the largest character in the set.

Problems with Character Sets

If you try to use a character set that is not compiled into your binary, you can run into a couple of different problems:

  • Your program has a wrong path to where the character sets are stored. (The default is /usr/local/mysql/share/mysql/charsets.) This can be fixed by using the --character-sets-dir option to the program in question.

  • The character set is a multi-byte-character set that can’t be loaded dynamically. In this case you have to recompile the program with the support for the character set.

  • The character set is a dynamic character set, but you don’t have a configure file for it. In this case you should install the configure file for the character set from a new MySQL distribution.

  • Your Index file doesn’t contain the name for the character set.

    ERROR 1105: File '/usr/local/share/mysql/charsets/?.conf' not found
    (Errcode: 2)

    In this case you should either get a new Index file or add by hand the name of any missing character sets.

For MyISAM tables, you can check the character set name and number for a table with myisamchk -dvv table_name.

Get MySQL Reference Manual now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.