Book HomeXML in a Nutshell

5.9. The Default Character Set for XML Documents

Before an XML parser can read a document, it must know which character set and encoding the document uses. In some cases, external metainformation tells the parser what encoding the document uses. For instance, an HTTP header may include a Content-type header like this:

Content-type: text/html; charset=ISO-8859-1

However, XML parsers generally can't count on the availability of such information. Even if they can, they can't necessarily assume that it's accurate. Therefore, an XML parser will attempt to guess the character set based on the first several bytes of the document. The main checks the parser makes include the following:

Parsers that understand EBCDIC or UCS-4 may also apply similar heuristics to detect those encodings. However, UCS-4 isn't really used yet and is mostly of theoretical interest, and EBCDIC is a legacy family of character sets that shouldn't be used in new documents. Neither of these sets are important in practice.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.