Book HomeXML in a Nutshell

5.4. XML-Defined Character Sets

An XML parser is required to handle the UTF-16 and UTF-8 encodings or Unicode (about which more follows). However, XML parsers are allowed to understand and process many other character sets. In particular, the specification recommends that processors recognize and be able to read these encodings:

UTF-8

UTF-16

ISO-10646-UCS-2

ISO-10646-UCS-4

ISO-8859-1

ISO-8859-2

ISO-8859-3

ISO-8859-4

ISO-8859-5

ISO-8859-6

ISO-8859-7

ISO-8859-8

ISO-8859-9

ISO-2022-JP

Shift_JIS

EUC-JP

Many XML processors understand other legacy encodings. For instance, processors written in Java often understand all character sets available in a typical Java virtual machine. For a list, see http://java.sun.com/products/jdk/1.3/docs/guide/intl/encoding.doc.html. Furthermore, some processors may recognize aliases for these encodings; both Latin-1 and 8859_1 are sometimes used as synonyms for ISO-8859-1. However, using these names limits your document's portability. We recommend that you use standard names for standard encodings. For encodings whose standard name isn't given by the XML 1.0 specification, use one of the names registered with the Internet Assigned Numbers Authority (IANA) and listed at ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets. However, knowing the name of a character set and saving a file in that set does not mean that your XML parser can read such a file. XML parsers are only required to support UTF-8 and UTF-16. They are not required to support the hundreds of different legacy encodings used around the world.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.