Book HomeWeb Design in a NutshellSearch this book

7.2. HTML 4.01 Language Features

Coordinating character sets is only the first part of the challenge. Even languages that share a character set may have different rules for hyphenation, spacing, quotation marks, punctuation, and so on. In addition to character shapes (glyphs), issues such as directionality (whether the text reads left-to-right or right-to-left) and cursive joining behavior have to be taken into account as well.

This prompted a need for a system of language identification. The W3C responded by incorporating into HTML the language tags put forth in the RFC 2070 standard on internationalization.

7.2.1. The lang Attribute

The lang attribute can be added within any tag to specify the language of the contained element. It can also be added within the <html> tag to specify a language for an entire document. The following example specifies the document's language as French:

<HTML LANG="fr">

It can also be used within text elements to switch to other languages within a document; for example, you can "turn on" Norwegian for just one element:

<BLOCKQUOTE lang="no">...</BLOCKQUOTE>

The value for the lang attribute is a language code (not the same as a country code). The current HTML and XML specifications support the two-letter country codes established in RFC 1766. These are listed in Table 7-1. However, there have been advancements in language identification to include three-letter codes, two-letter codes with country subcode (for example, fr-CA for French as used in Canada), and other descriptive subcodes as proposed in RFC 3066. Eventually, this revised system will be supported in future updates of HTML and XML specifications.

Table 7-1. Two-letter codes of language names

Code

Country

Code

Country

Code

Country

aa

Afar

fy

Frisian

lv

Latvian

ab

Abkhazian

ga

Irish

mg

Malagasy

af

Afrikaans

gd

Scots Gaelic

mi

Maori

am

Amharic

gl

Galician

mk

Macedonian

ar

Arabic

gn

Guarani

ml

Malayalam

as

Assamese

gu

Gujarati

mn

Mongolian

ay

Aymara

ha

Hausa

mo

Moldavian

az

Azerbaijani

he

Hebrew (formerly iw)

mr

Marathi

ba

Bashkir

hi

Hindi

ms

Malay

be

Byelorussian

hr

Croatian

mt

Maltese

bg

Bulgarian

hu

Hungarian

my

Burmese

bh

Bihari

hy

Armenian

na

Nauru

bi

Bislama

ia

Interlingua

ne

Nepali

bn

Bengali; Bangla

id

Indonesian (formerly in)

nl

Dutch

bo

Tibetan

ie

Interlingue

no

Norwegian

br

Breton

ik

Inupiak

oc

Occitan

ca

Catalan

is

Icelandic

om

(Afan) Oromo

co

Corsican

it

Italian

or

Oriya

cs

Czech

iu

Inuktitut

pa

Punjabi

cy

Welsh

ja

Japanese

pl

Polish

da

Danish

jw

Javanese

ps

Pashto, Pushto

de

German

ka

Georgian

pt

Portuguese

dz

Bhutani

kk

Kazakh

qu

Quechua

el

Greek

kl

Greenlandic

rm

Rhaeto-Romance

en

English

km

Cambodian

rn

Kirundi

eo

Esperanto

kn

Kannada

ro

Romanian

es

Spanish

ko

Korean

ru

Russian

et

Estonian

ks

Kashmiri

rm

Kinyarwanda

eu

Basque

ku

Kurdish

sa

Sanskrit

fa

Persian

ky

Kirghiz

sd

Sindhi

fi

Finnish

la

Latin

sg

Sangho

fj

Fiji

lm

Lingala

sh

Serbo-Croatian

fo

Faroese

lo

Laothian

si

Sinhalese

fr

French

lt

Lithuanian

sk

Slovak

sl

Slovenian

tg

Tajik

uk

Ukrainian

sm

Samoan

th

Thai

ur

Urdu

sn

Shona

ti

Tigrinya

uz

Uzbek

so

Somali

tk

Turkmen

vi

Vietnamese

sq

Albanian

tl

Tagalog

vo

Volapuk

sr

Serbian

tn

Setswana

wo

Wolof

ss

Siswati

to

Tonga

xh

Xhosa

st

Sesotho

tr

Turkish

yi

Yiddish (formerly ji)

su

Sundanese

ts

Tsonga

yo

Yoruba

sv

Swedish

tt

Tatar

za

Zhuang

sw

Swahili

tw

Twi

zh

Chinese

ta

Tamil

ug

Uighur

zu

Zulu

te

Telugu

7.2.2. Directionality

An internationalized HTML standard needs to take into account that many languages read from right to left. Directionality is part of a character's encoding within Unicode.

The HTML 4.01 specification provides the new dir attribute for specifying the direction in which the text should be interpreted. It can be used in conjunction with the lang attribute and may be added within the tags of most elements. The accepted value for direction is either ltr for left-to-right or rtl for right-to-left. For example, the following code indicates that the paragraph is intended to be displayed in Arabic, reading from right to left:

<P LANG="ar" DIR="rtl">...</P>

There is also a new tag introduced in HTML 4.01 that deals specifically with documents that contain combinations of left- and right-reading text (bidirectional text, or Bidi for short). The <bdo> tag is used for "bidirectional override," in other words, to specify a span of text that should override the intrinsic direction (as inherited from Unicode) of the text it contains. The <bdo> tag takes the dir attribute as follows:

<BDO DIR="ltr">English phrase in an otherwise Hebrew text</BDO>...

The <bdo> element and dir attribute are currently not supported by browsers.

7.2.3. Cursive Joining Behavior

In some writing systems, the shape of a character varies depending on its position in the word. For instance, in Arabic, a character used at the beginning of a word looks completely different when it is used as the last character of a word. Generally, this joining behavior is handled within the software, but there are Unicode characters that give precise control over joining behavior. They have zero width and are placed between characters purely to specify whether the neighboring characters should join.

HTML 4.01 provides mnemonic character entities for both these characters, as shown in Table 7-2.

Table 7-2. Unicode characters for joining behavior

Mnemonic

Numeric

Name

Description

&zwnj;

&#8204;

zero-width non-joiner

Prevents joining of characters that would otherwise be joined

&zwj;

&#8205;

zero-width joiner

Joins characters that would otherwise not be joined



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.