HTML 4.01 Language Features (Web Design in a Nutshell, 2nd Edition)

7.2.1. The lang Attribute

The lang attribute can be added within any tag to specify the language of the contained element. It can also be added within the <html> tag to specify a language for an entire document. The following example specifies the document's language as French:

<HTML LANG="fr">

It can also be used within text elements to switch to other languages within a document; for example, you can "turn on" Norwegian for just one element:

<BLOCKQUOTE lang="no">...</BLOCKQUOTE>

The value for the lang attribute is a language code (not the same as a country code). The current HTML and XML specifications support the two-letter country codes established in RFC 1766. These are listed in Table 7-1. However, there have been advancements in language identification to include three-letter codes, two-letter codes with country subcode (for example, fr-CA for French as used in Canada), and other descriptive subcodes as proposed in RFC 3066. Eventually, this revised system will be supported in future updates of HTML and XML specifications.

Table 7-1. Two-letter codes of language names

Code	Country	Code	Country	Code	Country
aa	Afar	fy	Frisian	lv	Latvian
ab	Abkhazian	ga	Irish	mg	Malagasy
af	Afrikaans	gd	Scots Gaelic	mi	Maori
am	Amharic	gl	Galician	mk	Macedonian
ar	Arabic	gn	Guarani	ml	Malayalam
as	Assamese	gu	Gujarati	mn	Mongolian
ay	Aymara	ha	Hausa	mo	Moldavian
az	Azerbaijani	he	Hebrew (formerly iw)	mr	Marathi
ba	Bashkir	hi	Hindi	ms	Malay
be	Byelorussian	hr	Croatian	mt	Maltese
bg	Bulgarian	hu	Hungarian	my	Burmese
bh	Bihari	hy	Armenian	na	Nauru
bi	Bislama	ia	Interlingua	ne	Nepali
bn	Bengali; Bangla	id	Indonesian (formerly in)	nl	Dutch
bo	Tibetan	ie	Interlingue	no	Norwegian
br	Breton	ik	Inupiak	oc	Occitan
ca	Catalan	is	Icelandic	om	(Afan) Oromo
co	Corsican	it	Italian	or	Oriya
cs	Czech	iu	Inuktitut	pa	Punjabi
cy	Welsh	ja	Japanese	pl	Polish
da	Danish	jw	Javanese	ps	Pashto, Pushto
de	German	ka	Georgian	pt	Portuguese
dz	Bhutani	kk	Kazakh	qu	Quechua
el	Greek	kl	Greenlandic	rm	Rhaeto-Romance
en	English	km	Cambodian	rn	Kirundi
eo	Esperanto	kn	Kannada	ro	Romanian
es	Spanish	ko	Korean	ru	Russian
et	Estonian	ks	Kashmiri	rm	Kinyarwanda
eu	Basque	ku	Kurdish	sa	Sanskrit
fa	Persian	ky	Kirghiz	sd	Sindhi
fi	Finnish	la	Latin	sg	Sangho
fj	Fiji	lm	Lingala	sh	Serbo-Croatian
fo	Faroese	lo	Laothian	si	Sinhalese
fr	French	lt	Lithuanian	sk	Slovak
sl	Slovenian	tg	Tajik	uk	Ukrainian
sm	Samoan	th	Thai	ur	Urdu
sn	Shona	ti	Tigrinya	uz	Uzbek
so	Somali	tk	Turkmen	vi	Vietnamese
sq	Albanian	tl	Tagalog	vo	Volapuk
sr	Serbian	tn	Setswana	wo	Wolof
ss	Siswati	to	Tonga	xh	Xhosa
st	Sesotho	tr	Turkish	yi	Yiddish (formerly ji)
su	Sundanese	ts	Tsonga	yo	Yoruba
sv	Swedish	tt	Tatar	za	Zhuang
sw	Swahili	tw	Twi	zh	Chinese
ta	Tamil	ug	Uighur	zu	Zulu
te	Telugu

7.2.2. Directionality

An internationalized HTML standard needs to take into account that many languages read from right to left. Directionality is part of a character's encoding within Unicode.

The HTML 4.01 specification provides the new dir attribute for specifying the direction in which the text should be interpreted. It can be used in conjunction with the lang attribute and may be added within the tags of most elements. The accepted value for direction is either ltr for left-to-right or rtl for right-to-left. For example, the following code indicates that the paragraph is intended to be displayed in Arabic, reading from right to left:

<P LANG="ar" DIR="rtl">...</P>

There is also a new tag introduced in HTML 4.01 that deals specifically with documents that contain combinations of left- and right-reading text (bidirectional text, or Bidi for short). The <bdo> tag is used for "bidirectional override," in other words, to specify a span of text that should override the intrinsic direction (as inherited from Unicode) of the text it contains. The <bdo> tag takes the dir attribute as follows:

<BDO DIR="ltr">English phrase in an otherwise Hebrew text</BDO>...

The <bdo> element and dir attribute are currently not supported by browsers.

7.2.3. Cursive Joining Behavior

In some writing systems, the shape of a character varies depending on its position in the word. For instance, in Arabic, a character used at the beginning of a word looks completely different when it is used as the last character of a word. Generally, this joining behavior is handled within the software, but there are Unicode characters that give precise control over joining behavior. They have zero width and are placed between characters purely to specify whether the neighboring characters should join.

HTML 4.01 provides mnemonic character entities for both these characters, as shown in Table 7-2.

Table 7-2. Unicode characters for joining behavior

Mnemonic	Numeric	Name	Description
`&zwnj;`	`‌`	zero-width non-joiner	Prevents joining of characters that would otherwise be joined
`&zwj;`	`‍`	zero-width joiner	Joins characters that would otherwise not be joined


7. Internationalization		7.3. Style Sheets Language Features

7.2. HTML 4.01 Language Features

7.2.1. The lang Attribute

Table 7-1. Two-letter codes of language names

7.2.2. Directionality

7.2.3. Cursive Joining Behavior

Table 7-2. Unicode characters for joining behavior