There are a range of control-like Unicode characters, some of which fulfill the same role as markup. Which should I use, and which should I avoid?
The answer depends on which characters are being considered. For more detail you should read the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages. This article will summarize some of that information.
The following table lists Unicode characters that should not be used in a markup context, according to Unicode in XML & Other Markup Languages. You should use markup instead.
Names/ Description | Short Comment |
---|---|
Line and paragraph separator | use <br>, <p>, or equivalent |
BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) | Strongly discouraged where markup exists. |
Activate/Inhibit Symmetric swapping | Deprecated in Unicode |
Activate/Inhibit Arabic form shaping | Deprecated in Unicode |
Activate/Inhibit National digit shapes | Deprecated in Unicode |
Interlinear annotation characters | Use ruby markup |
Byte order mark / ZWNBSP | Use only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP |
Object replacement character | Use markup, e.g. HTML <object> or HTML <img> |
Scoping for Musical Notation | Use an appropriate markup language |
Language Tag code points | Use lang and/or xml:lang |
The bidirectional text embedding controls, in particular, often cause confusion. There are some places where these have to be used to produce correctly ordered bidirectional text in languages that use right-to-left scripts, such as Arabic, Hebrew, Thaana, etc. These are places where an element doesn't allow embedded markup, such as the title
element. Where markup is available, however, you should use it. For more information about this, see Unicode controls vs. markup for bidi support. For guidance on how to use the embedding controls in situations where markup cannot be used, see Using Unicode controls for bidi text.
This is not an exhaustive list. It is merely intended to provide some examples of Unicode characters that are valid for use in addition to markup to provide information about the text.
Names/ Description | Short Comment |
---|---|
Various | No-break space, Soft Hyphen, Combining Grapheme Joiner, Non breaking Hyphen, Word Joiner, etc. |
Zero-width Joiners (ZWJ and ZWNJ) | eg. required for Persian |
Implicit directional marks (LRM and RLM) | |
Subtending marks | common feature in the Arabic and Syriac scripts |
Variation Selectors | eg. required for Mongolian |
Ideographic Description Characters | indicate the composition of ideographs |
This is taken from Unicode in XML & Other Markup Languages:
The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on", in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.
The following table gives an non-exhaustive list of examples.
Names/ Description | Examples | Verdict |
---|---|---|
Circled letters and digits used for list item markers | ① ② ③ Ⓐ Ⓑ Ⓒ ㊂ ㊃ ㊄ ㊓ ㊔ ㊕ ㋝ ㋞ ㋟ | OK |
Parenthesized or dotted number used as list item markers | ⑴ ⑵ ⑶ | use list item marker style |
Arabic Presentation forms | ﻉ ﻊ ﻋ ﻌ | normalize |
Half-width and full-width characters | ヤ ユ ヨ ラ a b c d | OK |
Superscripted and subscripted characters | ¹ ² ³ ₁ ₂ ₃ | use <sup> or <sub> markup |
Getting started? Introducing Character Sets and Encodings
Related links, Authoring HTML & CSS
Related links, Authoring XML