Perl source | Recommended Data Files | How to use | Future Plans | Background | Version History
NEW (2009-11-30): Charlint updated for Unicode 5.2.0.
Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model.
Charlint , aka 'Charlie', is written in Perl 5 (mostly independent of minor Perl versions). You can get the Charlint source from anonymous CVS. For initial checkout, use:
prompt$ cvs -d :pserver:[email protected]:/sources/public login Logging in to :pserver:[email protected]:2401/sources/public CVS password: anonymous prompt$ cvs -d :pserver:[email protected]:/sources/public get charlint cvs server: Updating charlint U charlint/Overview.html U charlint/README.cvs U charlint/charlint.pl
If you don't have CVS, use the Web to download the newest version or start with the charlint CVS overview. Charlint is covered by the W3C software licence. To install charlint, please make sure you have installed Perl 5, you have downloaded an appropriate character data file, and you have downloaded the Perl source. Please send error reports or comments to [email protected]; for anouncements and public discussion please see the Winter mailing list ([email protected]).
You'll also need the perl module Storable. The best way to install it may depend on your system.
Reading in a data file can take some time. You can use -s and -S to store a preprocessed file and load it back quickly for faster processing.
Charlint needs information on characters in order to work correctly. To indicate the file you want to use, please use the -f option. The recommended character data file is available from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt. Charlint is updated when needed to work with new data files and to bring the composition exclusions (which are hand-coded) up to date.
Charlint is a perl script that works as a simple filter. It uses UTF-8 both for input and for output. Behavior can be fine-tuned with various options. Without options, charlint converts input text to NFC (Normalization Form C, Canonical Composition). But charlint also checks for correct UTF-8, makes sure that there is no initial BOM, and warns about undefined and private-use area codepoints.
To preprocess the original data file, use -f originalDatafile -S
storageFile -d -D
. This saves all the data necessary for later
processing for all normalization forms. To specify the storage file on later
runs, use -s storageFile
.
For NFC, no special option is necessary. For NFD, use -x
; for
NFKC, use -K
; for NFKD, use -x -K
. Each option has to
be given separately. To avoid any normalization, but still do the various
checks, use -C
. Charlint checks the input from the Unicode
database file very carefully. To get the output of these operations, use
-d
.
A list of all options such as the one below can be optained by using charlint -h.
(options prefixed by # are currently not available) -b: Remove initial 'Byte Order Mark' -B: Supress warning about initial 'Byte Order Mark' -c: Detect non-normalized data (but do not normalize) -C: Do not normalize -d: Debug: Thoroughly check character data table input -D: Leave after reading in character data -e: # remove undefined codepoints -E: Do not warn about undefined codepoints -f file: Read data from file (no default anymore) (please use newest V3.2.0 datafiles) -F951: Use old (wrong) mapping for U+F951 (use this option if you really need 3.1.0 behaviour) -h: Prints out this short description -k: # Warn about compatibility codepoints -K: Normalize out (i.e. decompose) compatibility codepoints -n: Accept &#ddddd; and &#xhhhh; on input (beware of <![CDATA[, <SCRIPT>, <STYLE>) -nX: same as -n, plus &#Xhhhh; (use for HTML only!) -N: Produce &#xhhhh; on output -o: Print out 'unprintable' bytes as \\octal -p: # Remove stuff in private use areas -P: Supress checking private use areas -q: Quiet, don't output progress messages -s file: Read data from file produced with -S -S file: Write data to file for fast reload with -s -u: # Fix UTF-8 (convert or remove) -U: Supress checking correctness of UTF-8 -v: Print version -x: Do decomposition only -X: Don't do decomposition (assume input is decomposed) -YWH: Treat YOD WITH HIRIQ as precomposed (use this option if you really need 3.0.0 behaviour)
# 2009/11/28: 0.55, updated to Unicode Version 5.2.0 MJD # 2002/06/24: 0.54, improving -nf16check (compiler warnings, speed) MJD # 2002/06/08: 0.53, added -nf16check data file production MJD # 2002/08/23: 0.52, changed default file to UnicodeData.txt MJD # 2002/05/21: 0.51, added option -nX (use for HTML only!) MJD # 2002/04/03: 0.50, updated for 3.2.0; added -F951; added -c MJD # 2001/10/03: 0.49, code cleanup for use strict and -w MJD # 2001/04/01: 0.48, updated for 3.1.0 (final) MJD # 2001/03/07: 0.47, YOD WITH HIRIQ corrigendum MJD # 2000/12/19: 0.46, updated for 3.1.0 (beta) MJD # 2000/11/12: 0.45, bug fix for CJK extension A MJD # 2000/11/09: 0.44, implemented -s/-S (Storable data) MJD # 2000/10/05: 0.43, implemented -K (kompatibility decomposition) MJD # 2000/10/05: 0.42, updated for 3.0.1, fixed line ends MJD # 2000/10/05: 0.41, added 2000 to copyright, tested CVS commit MJD # 2000/08/03: 0.40, added Hangul support and did quite some testing MJD # 2000/08/02: 0.37, added -x and -X for decomposition MJD # 2000/07/27: 0.36, fixed a bug for non-starter decompositions MJD # 2000/07/24: 0.35, adapted exclusions to 3.0.0 final (+Tibetan) MJD # 2000/07/24: 0.34, $chClass = $CombClass{ch}; should read $chClass = $CombClass{$ch}; # implemented -C MJD # 1999/08/16: 0.33, updated for second version of 3.0.0.beta MJD # 1999/07/01: 0.32, adapted surrogates/exclusions to 3.0.0.beta MJD # 1999/06/25: 0.31, fixed reordering bug, going public MJD # 1999/06/23: 0.30, preparation for W3C member test, without Hangul MJD
Charlint is still being maintained, so bug reports and patches are welcome.
There are many additional features that we have thought about adding, such as:
However, we would probably rewrite charlint in Ruby before adding major new features.