Book HomePerl & XML

2.6. Unicode, Character Sets, and Encodings

At low levels, computers see text as a series of positive integer numbers mapped onto character sets, which are collections of numbered characters (and sometimes control codes) that some standards body created. A very common collection is the venerable US-ASCII character set, which contains 128 characters, including upper- and lowercase letters of the Latin alphabet, numerals, various symbols and space characters, and a few special print codes inherited from the old days of teletype terminals. By adding on the eighth bit, this 7-bit system is extended into a larger set with twice as many characters, such as ISO-Latin1, used in many Unix systems. These characters include other European characters, such as Latin letters with accents, Icelandic characters, ligatures, footnote marks, and legal symbols. Alas, humanity, a species bursting with both creativity and pride, has invented many more linguistic symbols than can be mapped onto an 8-bit number.

For this reason, a new character encoding architecture called Unicode has gained acceptance as the standard way to represent every written script in which people might want to store data (or write computer code). Depending on the flavor used, it uses up to 32 bits to describe a character, giving the standard room for millions of individual glyphs. For over a decade, the Unicode Consortium has been filling up this space with characters ranging from the entire Han Chinese character set to various mathematical, notational, and signage symbols, and still leaves the encoding space with enough room to grow for the coming millennium or two.

Given all this effort we're putting into hyping it, it shouldn't surprise you to learn that, while an XML document can use any type of encoding, it will by default assume the Unicode-flavored, variable-length encoding known as UTF-8. This encoding uses between one and six bytes to encode the number that represents the character's Unicode address and the character's length in bytes, if that address is greater than 255. It's possible to write an entire document in 1-byte characters and have it be indistinguishable from ISO Latin-1 (a humble address block with addresses ranging from 0 to 255), but if you need the occasional high character, or if you need a lot of them (as you would when storing Asian-language data, for example), it's easy to encode in UTF-8. Unicode-aware processors handle the encoding correctly and display the right glyphs, while older applications simply ignore the multibyte characters and pass them through unharmed. Since Version 5.6, Perl has handled UTF-8 characters with increasing finesse. We'll discuss Perl's handling of Unicode in more depth in Chapter 3, "XML Basics: Reading and Writing".



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.