Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings 1.0 -- (Editors' copy)

1 Introduction

1.1 Who should use this document

All HTML content authors working with XHTML 1.0, HTML 4.01, XHTML 1.1, CSS1, CSS2 and CSS3.

The term author is used in the sense described by the HTML 4.01 spec, ie. as a person or program that writes or generates HTML documents.

This document provides guidance for the development of HTML so that it will support international usage. This is the responsibility of all content authors, not just the localization group, and is relevant from the very start of development. Ignoring the advice in this document, or relegating it to a later phase in the development, will only add unnecessary costs and resource issues at a later date.

It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice related specifically to internationalization.

1.2 How to use this document

If you are new to this topic you may wish to read this document from end to end. It is, however, expected that this document will normally be used for reference purposes - the reader dipping in to a particular section to find out how to perform a specific task with internationalization in mind.

This document is one of several documents relating to the design of XHTML and HTML documents. An overview document is available that summarises all the recommendations of this and its companion documents together, organized according to tasks that a developer of XHMTL/HTML content may want to perform. When this material is used as a reference, it is recommended that the overview document is used as a starting point.

Cross references and further resources are summarized at the end of each section.

Editorial notes have been left in this version of the document. These are marked [Ed. note: like this].

For information about the applicability of recommendations to user agents see below.

1.3 Standards addressed

This document provides techniques for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS1, CSS2 and some parts of CSS3.

Note that XHTML source can be served as XML (using MIME types application/xhtml+xml, application/xml or text/xml) or HTML (using the MIME type text/html).

It is very common for XHTML to be served as HTML, following the compatibility guidelines in Appendix C of the XHTML 1.0 specification. This allows authors with the right editing tools to produce valid XML code, which therefore lends itself to processing with such things as scripting or XSLT, but is also well supported for display by most mainstream browsers. (XHTML served as application/xhtml+xml is not well supported for browser display at the moment.) In this document we wish to reflect practical reality for content authors, so we cover XHTML served as text/html in the techniques.

Indeed we encourage the use of XHTML, and all the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML.

For XHTML served as XML, this document limits its advice to documents served as application/xhtml+xml. Note that user agent support for XHTML served as XML is still patchy.

1.4 User agents addressed

In order to improve the value of this information to the user we try to ground techniques with information about their applicability to particular user agents.

User agents, in this current version, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)

In an attempt to make the task of tracking browser applicability manageable, we have chosen a 'base version' for each of the user agents we are tracking for applicability. This base version represents a fairly recent, standards-compliant version of the browser. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).

The base versions considered for this version of the document include:

Internet Explorer 6 (Windows)
Mozilla 1.4
Opera 7
Netscape Navigator 7
Safari
Internet Explorer 5 (Mac)

If the technique is applicable to a base version of a user agent the name of that user agent will appear immediately below the summary of the technique. If the technique is not applicable, the name will appear crossed out. If the name does not appear at all, this signifies that further investigation is needed. If the technique is applicable to a later version than the chosen base version, this will be indicated by adding the version number to the name.

Detailed information may also be provided from time to time about behavior of a user agent in an earlier version than the base version, or about some particular aspect of the behavior of a base version or later user agent. This is provided in a special boxed section within the body of the text.

1.5 Editorial notes

[Ed. note: Prereading: Draw out the distinction between the document character set (always Unicode) and the document encoding.]

[Ed. note: add normalisation info [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

[Ed. note: incorporate guidance related to Character Model & Unicode and Markup Languages [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

2 Choosing a page encoding

Choose UTF-8 or another Unicode encoding for all content.

IE(Win) Mozilla Opera NNav Safari IE(Mac)

When selecting a page encoding, consider both current and future localization requirements, and the benefits of using the same encoding across all pages and all languages. These considerations make the use of Unicode an attractive choice for the following reasons:

Unicode supports many languages, enabling the use of a single encoding across all pages and forms, regardless of language.
Unicode allows many more languages to be mixed on a single page than almost any other choice. If the set of languages to be represented on a single page cannot be represented directly by any single native encoding (such as ISO-8859-1, Shift-JIS, etc.), then Unicode is almost certainly the best choice.[Ed. note: How is this different from the previous point?]
For dynamically-generated pages, a single encoding for all pages eliminates the need for server-side logic to determine the character encoding for each page served.
For interactive applications using forms, a single encoding eliminates the need for server-side logic to determine the character encoding of incoming form data.
Unicode enables a form in one language (e.g. English) to accept input in a different language (e.g. Chinese).
Unicode (UTF-8) forms will be easier to migrate to XForms.[Ed. note: We should add some justification for this.]

UTF-8 and UTF-16 are both Unicode encodings. Since support for Unicode is currently limited to UTF-8 in many user agents, UTF-8 is usually the appropriate Unicode encoding. However, as user agent support for UTF-16 expands, UTF-16 will become an increasingly viable alternative.

Although there are other multi-script encodings (such as ISO-2022 and GB18030), Unicode generally provides the best combination of user agent and script support.

Resources:

How to's

[Unicode] The Unicode Standard 4.0
The Unicode Standard is very readable and contains a large amount of useful information besides code point listings.

If you don't use a Unicode encoding, select an encoding that best supports the languages / characters to be included in the page text. [Ed. note: What does this mean? Does it mean, which maximizes the opportunity to directly represent characters and minimizes the need to represent characters by markup means such as character escapes? Does it include the idea that you should choose the most commonly used encoding for a region?]

IE(Win) Mozilla Opera NNav Safari IE(Mac)

There are some situations where selecting a Unicode encoding is not practical. If content is encoded in a native encoding (legacy content or content originating from an external source) and the system lacks functionality for converting content between encodings, Unicode may greatly complicate implementation. If such a site is only required to serve single-script pages (containing languages that can be represented by a single native encoding), then the cost of using a Unicode encoding may outweigh the benefits. In this case, a native encoding (such as ISO-8859-1, Shift-JIS, etc.) may be a better choice.

Be sure to select an encoding that covers most [Ed. note: all? ]of the characters required for the content, and (if it is a form) all of the characters that must be accepted as input.

Resources:

Reference links

Alan Wood’s Unicode Resources
Various resources about Unicode and multilingual support in HTML, fonts, web browsers and other applications.