Book HomeXML in a Nutshell

5.6. ISO Character Sets

Unicode has only recently become popular. Previously, the space and processing costs associated with Unicode files prompted vendors to prefer smaller, single-byte character sets that could only handle English and a few other languages of interest, but not the full panoply of human language. The International Standards Organization (ISO) has standardized 14 of these character sets as ISO standard 8859. For all of these single-byte character sets, characters 0 through 127 are identical to the ASCII character set; characters 128 through 159 are the C1 controls; and characters 160 through 255 are the additional characters needed for scripts such as Greek, Cyrillic, and Turkish.

ISO-8859-1 (Latin-1)
ASCII plus the accented letters and other characters needed for most Latin-alphabet Western European languages, including Danish, Dutch, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish.

ISO-8859-2 (Latin-2)
ASCII plus the accented letters and other characters needed to write most Latin-alphabet Central and Eastern European languages, including Czech, English, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovenian, and Sorbian.

ISO-8859-3 (Latin-3)
ASCII plus the accented letters and other characters needed to write Esperanto, Maltese, and Turkish.

ISO-8859-4 (Latin-4)
ASCII plus the accented letters and other characters needed to write most Baltic languages including Estonian, Latvian, Lithuanian, Greenlandic, and Lappish. Now deprecated. New applications should use 8859-10 (Latin-6) or 8859-13 (Latin-7) instead.

ISO-8859-5
ASCII plus the Cyrillic alphabet used for Russian and many other languages of the former Soviet Union and other Slavic countries, including Bulgarian, Byelorussian, Macedonian, Serbian, and Ukrainian.

ISO-8859-6
ASCII plus basic Arabic. However, the character set doesn't have the extra letters needed for non-Arabic languages written in the Arabic script, such as Farsi and Urdu.

ISO-8859-7
ASCII plus modern Greek. This set does not have the extra letters and accents necessary for ancient and Byzantine Greek.

ISO-8859-8
ASCII plus the Hebrew script used for Hebrew and Yiddish.

ISO-8859-9 (Latin-5)
Essentially the same as Latin-1, except six letters used in Icelandic have been replaced with six letters used in Turkish.

ISO-8859-10 (Latin-6)
ASCII plus accented letters and other characters needed to write most Baltic languages, including Estonian, Icelandic, Latvian, Lithuanian, Greenlandic, and Lappish.

ISO-8859-11
ASCII plus Thai.

ISO-8859-13 (Latin-7)
Yet another attempt to cover the Baltic region properly. Very similar to Latin-6, except for some question marks.

ISO-8859-14 (Latin-8)
ASCII plus the Celtic languages, including Gaelic and Welsh.

ISO-8859-15 (Latin-9, Latin-0)
A revised version of Latin-1 that replaces some unnecessary symbols, such as 1/4, with extra French and Finnish letters. Instead of the international currency sign, these sets include the Euro sign Figure .

ISO-8859-16, (Latin-10)
A revised version of Latin-2 that works better for Romanian. Other languages supported by this character set include Albanian, Croatian, English, Finnish, French, German, Hungarian, Italian, Polish, and Slovenian.

Various national standards bodies have produced other character sets to cover scripts and languages of interest within their geographic and political boundaries. For example, the Korea Industrial Standards Association developed the KS C 5601-1992 standard for encoding Korean. These national standard character sets can be used in XML documents as well, provided that you include the proper encoding declaration in the document and your parser knows how to translate these character sets into Unicode.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.