Copyright © 2004 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document is out of date and contains incorrect information. For the latest information about character encodings in HTML and CSS, see Internationalization Techniques: Authoring HTML & CSS.
It is important to consider character encoding matters when producing internationalization content, and further to understand how to choose and declare encodings, how and when to use character escapes, etc.
This document is one of a series of documents providing HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about character sets, encodings, and other character-specific matters. It is produced by the Guidelines, Education & Outreach Task Force (GEO) of the W3C Internationalization Working Group (I18N WG). The GEO Task Force encourages feedback about the content of this document as well as participation in the development of the techniques by people who have experience creating Web content that conforms to internationalization needs.
This document is an editors' copy that has no official standing.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the First Public Working Draft of a document produced by the GEO (Guidelines, Education & Outreach) Task Force of the W3C Internationalization Working Group (I18N WG). The Internationalization Working Group is part of the W3C Internationalization Activity. This is a draft document that does not fully represent the consensus of the group at this time. The Working Group expects to advance this Working Draft to Working Group Note.
The document provides practical techniques related to character sets, encodings, and other character-specific matters that HTML content authors can use to ensure that their HTML is easily adaptable for an international audience. These are techniques that are best addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on.
This document was last published as part of a larger document entitled Authoring Techniques for XHTML & HTML Internationalization 1.0. The material in that document will now be published as a number of smaller independent documents to allow for easier ongoing improvements and updates. The total number of such documents is not fixed, but will grow as material and resources become available. The title of all related documents will begin with "Authoring Techniques for XHTML & HTML Internationalization:..." and they can be found in the W3C technical reports index.
The Task Force encourages feedback about the content of this document as well as participation in the development of the guidelines by people who have experience creating Web content that conforms to internationalization needs. Send comments about this document to [email protected]. The archives for this list are publicly available.
The Internationalization Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document has been produced under the 24 January 2002 CPP as amended by the W3C Patent Policy Transition Procedure. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy. At the time of publication, the Working Group believed there were no patent disclosures relevant to this specification.
1 Introduction
1.1 Who should use this document
1.2 How to use this document
1.3 Standards addressed
1.4 User agents addressed
1.5 Editorial notes
2 Choosing a page encoding
3 Specifying a page encoding
3.1 Using the HTTP header
3.2 Declaring the encoding in-document
3.3 Declaring the encoding in more than one place
3.4 Choosing names for your encodings
4 Representing characters using escapes
All HTML content authors working with XHTML 1.0, HTML 4.01, XHTML 1.1, CSS1, CSS2 and CSS3.
The term author is used in the sense described by the HTML 4.01 spec, ie. as a person or program that writes or generates HTML documents.
This document provides guidance for the development of HTML so that it will support international usage. This is the responsibility of all content authors, not just the localization group, and is relevant from the very start of development. Ignoring the advice in this document, or relegating it to a later phase in the development, will only add unnecessary costs and resource issues at a later date.
It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice related specifically to internationalization.
If you are new to this topic you may wish to read this document from end to end. It is, however, expected that this document will normally be used for reference purposes - the reader dipping in to a particular section to find out how to perform a specific task with internationalization in mind.
This document is one of several documents relating to the design of XHTML and HTML documents. An overview document is available that summarises all the recommendations of this and its companion documents together, organized according to tasks that a developer of XHMTL/HTML content may want to perform. When this material is used as a reference, it is recommended that the overview document is used as a starting point.
Cross references and further resources are summarized at the end of each section.
Editorial notes have been left in this version of the document. These are marked [Ed. note: like this].
For information about the applicability of recommendations to user agents see below.
This document provides techniques for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS1, CSS2 and some parts of CSS3.
Note that XHTML source can be served as XML (using MIME types application/xhtml+xml
,
application/xml
or text/xml
) or HTML (using the MIME type text/html
).
It is very common for XHTML to be served as HTML, following the
compatibility guidelines in Appendix C of the XHTML
1.0 specification. This allows authors with the right editing tools to produce valid XML code, which therefore lends
itself to processing with such things as scripting or XSLT, but is also well supported for display by most mainstream
browsers. (XHTML served as application/xhtml+xml
is not well supported for browser display at the moment.)
In this document we wish to reflect practical reality for content authors, so we cover XHTML served as
text/html
in the techniques.
Indeed we encourage the use of XHTML, and all the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML.
For XHTML served as XML, this document limits its advice to documents served as
application/xhtml+xml
. Note that user agent support for XHTML served as XML is still patchy.
In order to improve the value of this information to the user we try to ground techniques with information about their applicability to particular user agents.
User agents, in this current version, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)
In an attempt to make the task of tracking browser applicability manageable, we have chosen a 'base version' for each of the user agents we are tracking for applicability. This base version represents a fairly recent, standards-compliant version of the browser. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).
The base versions considered for this version of the document include:
Internet Explorer 6 (Windows)
Mozilla 1.4
Opera 7
Netscape Navigator 7
Safari
Internet Explorer 5 (Mac)
If the technique is applicable to a base version of a user agent the name of that user agent will appear immediately below the summary of the technique. If the technique is not applicable, the name will appear crossed out. If the name does not appear at all, this signifies that further investigation is needed. If the technique is applicable to a later version than the chosen base version, this will be indicated by adding the version number to the name.
Detailed information may also be provided from time to time about behavior of a user agent in an earlier version than the base version, or about some particular aspect of the behavior of a base version or later user agent. This is provided in a special boxed section within the body of the text.
[Ed. note: Prereading: Draw out the distinction between the document character set (always Unicode) and the document encoding.]
[Ed. note: add normalisation info [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]
[Ed. note: incorporate guidance related to Character Model & Unicode and Markup Languages [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]
When selecting a page encoding, consider both current and future localization requirements, and the benefits of using the same encoding across all pages and all languages. These considerations make the use of Unicode an attractive choice for the following reasons:
Unicode supports many languages, enabling the use of a single encoding across all pages and forms, regardless of language.
Unicode allows many more languages to be mixed on a single page than almost any other choice. If the set of languages to be represented on a single page cannot be represented directly by any single native encoding (such as ISO-8859-1, Shift-JIS, etc.), then Unicode is almost certainly the best choice.[Ed. note: How is this different from the previous point?]
For dynamically-generated pages, a single encoding for all pages eliminates the need for server-side logic to determine the character encoding for each page served.
For interactive applications using forms, a single encoding eliminates the need for server-side logic to determine the character encoding of incoming form data.
Unicode enables a form in one language (e.g. English) to accept input in a different language (e.g. Chinese).
Unicode (UTF-8) forms will be easier to migrate to XForms.[Ed. note: We should add some justification for this.]
UTF-8 and UTF-16 are both Unicode encodings. Since support for Unicode is currently limited to UTF-8 in many user agents, UTF-8 is usually the appropriate Unicode encoding. However, as user agent support for UTF-16 expands, UTF-16 will become an increasingly viable alternative.
Although there are other multi-script encodings (such as ISO-2022 and GB18030), Unicode generally provides the best combination of user agent and script support.
There are some situations where selecting a Unicode encoding is not practical. If content is encoded in a native encoding (legacy content or content originating from an external source) and the system lacks functionality for converting content between encodings, Unicode may greatly complicate implementation. If such a site is only required to serve single-script pages (containing languages that can be represented by a single native encoding), then the cost of using a Unicode encoding may outweigh the benefits. In this case, a native encoding (such as ISO-8859-1, Shift-JIS, etc.) may be a better choice.
Be sure to select an encoding that covers most [Ed. note: all? ]of the characters required for the content, and (if it is a form) all of the characters that must be accepted as input.
Not all user agents support all page encodings, so it is important to understand which user agents must be able to render the page, and be sure that they have adequate support for the page encoding you have selected.
In general, user agents are most likely to support the commonly-used native character encodings for the major languages used on the web. Support for less commonly used encodings depends on the user agent. Older user agents, or user agents that operate under severe memory limitations, may not support UTF-8.
It is important to note that support for a given encoding does not necessarily imply support for all writing systems that encoding supports. For example, a user agent might support UTF-8, but not correctly display bidirectional Arabic text encoded in UTF-8. To display a page correctly, a user agents must support both the page encoding and the writing system.
.[Ed. note: Point to an updated version of the table in hints & tips]
For overviews of the mechanics of specifying a page encoding and additional examples, see the tutorial Character sets & encodings.
Whether you declare the encoding by passing information alongside the document in the HTTP header, or inside the document itself, you should always ensure that the encoding is declared. If you don't do this, the chances are high that your document will be incorrectly rendered.
Note also that you should include a character encoding declaration even if your document uses a basic Latin encoding such as ISO 8859-1. For example, Japanese user agents will default to a Japanese encoding that does not include the accented letters, so they may not see your text correctly unless you specified the encoding.
According to the HTML specification, in a case of conflict the HTTP charset declaration has the highest priority of all means of declaring the character set.
Advantages to this approach:
User agents can easily find the character encoding information when it is sent in the HTTP header.
The HTTP header information has the highest priority in case of conflict, so this approach should be used by intermediate servers that transcode the data (ie. convert to a different encoding). This is sometimes done for small devices that only recognize a small number of encodings. Because the HTTP header information has precedence over any in-document declaration, it doesn't matter that transcoders typically do not change the internal encoding declarations, just the document encoding.
There may be some disadvantages when dealing with static files or templates:
It may be difficult for content authors to change the encoding information on the server - especially when dealing with an ISP. They will need knowledge of and access to the server settings.
Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.
In addition, there are potential problems for both static and dynamic documents if they are to be saved by the user or used from a location such as a CD or hard disk. In these cases encoding information from an HTTP header is not available.
Similarly, if the character encoding is only declared in the HTTP header, this information may become separated from files that are processed by such things as XSLT or scripts, or from files that are sent for translation.
For these reasons you should always ensure that encoding information is also declared inside the document.
Care should also be taken to ensure that the server-side settings are maintained if the file is moved or the server technology is changed.
Discrepancies may arise due to the document being moved, because a server administrator or other content author changes settings that cascade to your document, or because the server or server version has changed, etc. Since encoding declarations in the HTTP header have highest priority in determining the encoding of the document, it is a very bad situation if the server-side settings are inadvertently changed.
If content authors need to set server-side settings, it is important to also ensure that they have the required knowledge, access and privileges to do so. This is especially important when dealing with a third-party ISP.
This does not rule out also declaring it in the HTTP information provided by the server, but provides for use of the document when the HTTP information is not available.
This is important for both static and dynamic documents if there is a chance that your documents will be saved to or read from disk, CD, etc.
Also, if the character encoding is only declared in the HTTP header, this information may become separated from files from files that are sent for translation or processed by such things as XSLT or scripts.
It is also valuable for developers, testers, or translation production managers who may want to perform a visual check of a document.
The following is an example of a meta statement. For more information about usage, see the tutorial Character sets & encodings.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
This approach is not appropriate for documents served as XML, but when serving a document as HTML, there are no disadvantages and a couple of definite advantages, even if the encoding has been declared in the HTTP header:
An in-document encoding allows the document to be read correctly when not on a server. This applies not only to static documents read from disk or CD, but also dynamic documents that are saved by the reader.
An in-document declaration of this kind helps developers, testers, or translation production managers who want to perform a visual check of a document. This applies particularly to static documents or templates used to generate dynamic documents.
This maximizes the likelihood that non-ASCII characters will be correctly recognized by the user agent.
The HTML spec says "The meta
declaration must only be used when the character encoding is
organized such that ASCII-valued bytes stand for ASCII characters (at least until the meta
element is parsed).
"
[Ed. note: How true is this?]
application/xhtml+xml
, always use an XML declaration with an encoding
attribute.The following is an example of an XML declaration. For more information about usage, see the tutorial Character sets & encodings.
<?xml version="1.0" encoding="UTF-8"?>
If you are serving XHTML as application/xhtml+xml
, the encoding attribute is mandatory unless you
are using UTF-8 or UTF-16 or declaring the encoding in the HTTP header.
Even if the file document is encoded in UTF-8 or UTF-16, declaring the encoding in the document is useful for the following reasons:
It is useful to have the encoding declared in the document when editing or processing the file as XML.
An in-document declaration helps developers, testers, or translation production managers who want to perform a visual check of a document. This is a good reason for including the encoding declaration even if the file is in UTF-8 or UTF-16, despite the fact that it is not strictly necessary for these encodings.
An in-document encoding allows the document to be read correctly when not read from the server.
There is likely to be no other in-document alternative to express the character encoding. (The charset
meta
declaration is not recognized by XML processors.)
The following is an example of a meta statement. For more information about usage, see the tutorial Character sets & encodings.
<?xml version="1.0" encoding="UTF-8"?>
Key reasons for using XHTML are to take advantage of the benefits that XML brings for editing and processing, but when these documents are served as text/html to user agents, they are treated as HTML, not XML.
Advantages to including an XML declaration include the following:
If your document is not encoded in UTF-8 or UTF-16 and the encoding is not declared in an HTTP header, it is necessary to have this in-document encoding declaration when editing or processing the file as XML, eg. using XSLT transformations or scripting, since the XML processors do not see HTTP information, and do not recognize the meta charset statement described earlier.
In some cases, you may want to serve the same static document as either HTML or XML, depending on the capabilities of the requesting user agent. This can be achieved by server-side logic. In these cases you will want to have an XML declaration in the document when it is served as XML. (We are assuming that the appropriate declaration can be added to the file via scripting for dynamically created documents.)
On the other hand:
Because the XML declaration may cause undesirable effects in some user agents (see Serving HTML & XHTML), you may prefer to omit it.
The XML declaration is not actually needed for HTML documents (which is what we are discussing here). HTML processors do not use this information, and the encoding information should be included in the meta charset statement described above.
In summary we could say the following:
If the XML declaration will not cause your document any harm, it is best to include it. If you do use an XML declaration, you should always declare the encoding in it.
If you are worried about the undesirable effects sometimes associated with use of the XML declaration in HTML files, the best solution is to omit the declaration but serve the file as UTF-8 or UTF-16.
If you use UTF-8 or UTF-16 the file is still perfectly valid XML, but no XML declaration is required.
This is required by the XHTML specification.
If all declarations are correct, then there will be no conflicts.
If you serve encoding information in the HTTP header, it is particularly important to ensure that it is always served correctly since this declaration has the highest priority. It is also the method most open to risks of inadvertent change.
Also ensure that any editing or scripting tools you use consistently apply the correct encoding information - especially if your tools add the declarations automatically.
The IANA charset registry shows a name plus a list of aliases for each registered charset value. One of these is identified as the preferred MIME name. Wherever you declare the character encoding, use the preferred MIME name in the charset value.
This maximizes the likelihood of interoperability.
This is not usually a good idea since it limits interoperability.
For an explanation of the different types of escape available in XHTML, HTML and CSS, see What are entities and NCRs?.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size. Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.
There are three characters which should always appear in content as escapes, so that they do not interact with the syntax of the markup:
< (<)
> (>)
& (&)
You may also want to represent the double-quote (") as " - particularly in attribute text when you need to use the same type of quotes as you used to surround the attribute value.
Escapes can be useful to represent characters not supported by the encoding you chose for the document. For example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).
If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters.
A potentially very useful role for escapes is for characters that are invisible or ambiguous in presentation.
One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using ‏ (or its NCR equivalent ‏) instead makes it very easy to spot these characters.
An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using (or  ) makes it quite clear where such spaces appear in the text.
It is a common error for people working on a page encoded in Windows code page 1252, for example, to try to represent the euro sign using €. This is because the euro appears at position 80 on the Windows 1252 code page. Using € would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. What was really needed was €.
Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes.
Any XML application recognizes numeric character references such as á as representing Unicode characters. On the other hand, an entity such as á has to be declared in the DTD or Schema to be recognized in the XML. Character entities are defined as part of the HTML / XHTML standard, but are often not incorporated in other flavours of XML.
If there is a likelihood that you will want to repurpose or process this information (including sometimes running it through localization tools), you should think carefully about which approach is most appropriate.
This is likely to be a very rare occurrence, firstly, because it is usually better to use style information in a separate stylesheet or stylesheet element; and, secondly, because there are not many situations where you are likely to need non-ASCII characters in styling that appears in an attribute.
The issue arises because a style
attribute in XHTML or HTML can represent characters using NCRs,
entities or CSS escapes. On the other hand, the style
element in HTML can contain neither NCRs
nor entities, and the same applies to an external style sheet.
Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.
For example, it is better to use
<span style="font-family: L\FC beck">...</span>
than
<span style="font-family: Lübeck">...</span>
tbd
Discuss
The following GEO Task Force members have contributed their time and valuable comments to shaping these guidelines:
Phil Arko, Steve Billings, Deborah Cawkwell, Wendy Chisholm, Andrew Cunningham, Martin Dürst, Lloyd Honomichl, Russ Rolfe, Peter Sigrist, Tex Texin, Najib Tounsi