Coordinating character sets is only the first part of the challenge. Even languages that share a character set may have different rules for hyphenation, spacing, quotation marks, punctuation, and so on. In addition to character shapes (glyphs), issues such as directionality (whether the text reads left-to-right or right-to-left) and cursive joining behavior have to be taken into account as well.
This prompted a need for a system of language identification. The W3C responded by incorporating into HTML the language tags put forth in the RFC 2070 standard on internationalization.
The lang attribute can be added within any tag to specify the language of the contained element. It can also be added within the <html> tag to specify a language for an entire document. The following example specifies the document's language as French:
<HTML LANG="fr">
It can also be used within text elements to switch to other languages within a document; for example, you can "turn on" Norwegian for just one element:
<BLOCKQUOTE lang="no">...</BLOCKQUOTE>
The value for the lang attribute is a language code (not the same as a country code). The current HTML and XML specifications support the two-letter country codes established in RFC 1766. These are listed in Table 7-1. However, there have been advancements in language identification to include three-letter codes, two-letter codes with country subcode (for example, fr-CA for French as used in Canada), and other descriptive subcodes as proposed in RFC 3066. Eventually, this revised system will be supported in future updates of HTML and XML specifications.
Code |
Country |
Code |
Country |
Code |
Country |
---|---|---|---|---|---|
aa |
Afar |
fy |
Frisian |
lv |
Latvian |
ab |
Abkhazian |
ga |
Irish |
mg |
Malagasy |
af |
Afrikaans |
gd |
Scots Gaelic |
mi |
Maori |
am |
Amharic |
gl |
Galician |
mk |
Macedonian |
ar |
Arabic |
gn |
Guarani |
ml |
Malayalam |
as |
Assamese |
gu |
Gujarati |
mn |
Mongolian |
ay |
Aymara |
ha |
Hausa |
mo |
Moldavian |
az |
Azerbaijani |
he |
Hebrew (formerly iw) |
mr |
Marathi |
ba |
Bashkir |
hi |
Hindi |
ms |
Malay |
be |
Byelorussian |
hr |
Croatian |
mt |
Maltese |
bg |
Bulgarian |
hu |
Hungarian |
my |
Burmese |
bh |
Bihari |
hy |
Armenian |
na |
Nauru |
bi |
Bislama |
ia |
Interlingua |
ne |
Nepali |
bn |
Bengali; Bangla |
id |
Indonesian (formerly in) |
nl |
Dutch |
bo |
Tibetan |
ie |
Interlingue |
no |
Norwegian |
br |
Breton |
ik |
Inupiak |
oc |
Occitan |
ca |
Catalan |
is |
Icelandic |
om |
(Afan) Oromo |
co |
Corsican |
it |
Italian |
or |
Oriya |
cs |
Czech |
iu |
Inuktitut |
pa |
Punjabi |
cy |
Welsh |
ja |
Japanese |
pl |
Polish |
da |
Danish |
jw |
Javanese |
ps |
Pashto, Pushto |
de |
German |
ka |
Georgian |
pt |
Portuguese |
dz |
Bhutani |
kk |
Kazakh |
qu |
Quechua |
el |
Greek |
kl |
Greenlandic |
rm |
Rhaeto-Romance |
en |
English |
km |
Cambodian |
rn |
Kirundi |
eo |
Esperanto |
kn |
Kannada |
ro |
Romanian |
es |
Spanish |
ko |
Korean |
ru |
Russian |
et |
Estonian |
ks |
Kashmiri |
rm |
Kinyarwanda |
eu |
Basque |
ku |
Kurdish |
sa |
Sanskrit |
fa |
Persian |
ky |
Kirghiz |
sd |
Sindhi |
fi |
Finnish |
la |
Latin |
sg |
Sangho |
fj |
Fiji |
lm |
Lingala |
sh |
Serbo-Croatian |
fo |
Faroese |
lo |
Laothian |
si |
Sinhalese |
fr |
French |
lt |
Lithuanian |
sk |
Slovak |
sl |
Slovenian |
tg |
Tajik |
uk |
Ukrainian |
sm |
Samoan |
th |
Thai |
ur |
Urdu |
sn |
Shona |
ti |
Tigrinya |
uz |
Uzbek |
so |
Somali |
tk |
Turkmen |
vi |
Vietnamese |
sq |
Albanian |
tl |
Tagalog |
vo |
Volapuk |
sr |
Serbian |
tn |
Setswana |
wo |
Wolof |
ss |
Siswati |
to |
Tonga |
xh |
Xhosa |
st |
Sesotho |
tr |
Turkish |
yi |
Yiddish (formerly ji) |
su |
Sundanese |
ts |
Tsonga |
yo |
Yoruba |
sv |
Swedish |
tt |
Tatar |
za |
Zhuang |
sw |
Swahili |
tw |
Twi |
zh |
Chinese |
ta |
Tamil |
ug |
Uighur |
zu |
|
te |
Telugu |
An internationalized HTML standard needs to take into account that many languages read from right to left. Directionality is part of a character's encoding within Unicode.
The HTML 4.01 specification provides the new dir attribute for specifying the direction in which the text should be interpreted. It can be used in conjunction with the lang attribute and may be added within the tags of most elements. The accepted value for direction is either ltr for left-to-right or rtl for right-to-left. For example, the following code indicates that the paragraph is intended to be displayed in Arabic, reading from right to left:
<P LANG="ar" DIR="rtl">...</P>
There is also a new tag introduced in HTML 4.01 that deals specifically with documents that contain combinations of left- and right-reading text (bidirectional text, or Bidi for short). The <bdo> tag is used for "bidirectional override," in other words, to specify a span of text that should override the intrinsic direction (as inherited from Unicode) of the text it contains. The <bdo> tag takes the dir attribute as follows:
<BDO DIR="ltr">English phrase in an otherwise Hebrew text</BDO>...
The <bdo> element and dir attribute are currently not supported by browsers.
In some writing systems, the shape of a character varies depending on its position in the word. For instance, in Arabic, a character used at the beginning of a word looks completely different when it is used as the last character of a word. Generally, this joining behavior is handled within the software, but there are Unicode characters that give precise control over joining behavior. They have zero width and are placed between characters purely to specify whether the neighboring characters should join.
HTML 4.01 provides mnemonic character entities for both these characters, as shown in Table 7-2.
Mnemonic |
Numeric |
Name |
Description |
---|---|---|---|
‌ |
zero-width non-joiner |
Prevents joining of characters that would otherwise be joined |
|
‍ |
‍ |
zero-width joiner |
Copyright © 2002 O'Reilly & Associates. All rights reserved.