Character Set Handling
Document type detection is one of the more important pieces of the content-processing puzzle, but it is certainly not the only one. For all types of text-based files rendered in the browser, one more determination needs to be made: The appropriate character set transformation must be identified and applied to the input stream. The output encoding sought by the browser is typically UTF-8 or UTF-16; the input, on the other hand, is up to the author of the page.
In the simplest scenario, the appropriate encoding method will be provided by the server in a charset parameter of the Content-Type header. In the case of HTML documents, the same information may also be conveyed to some extent through the <meta> directive. (The browser ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access