htmlchek -- Tools

From [email protected] Tue Mar 21 13:44:40 1995
Article: 24135 of comp.infosystems.www.misc
From: [email protected] (Henry Churchyard)
Subject: ANNOUNCE: htmlchek - HTML error checker and utilities - ver. 4.1
Date: 20 Mar 1995 21:50:29 -0600
Organization: The University of Texas at Austin; Austin, Texas

   A moderately significant bugfix and update to my htmlchek HTML
error checker program (and supplemental utilities) has been released,
adding several minor features for greater convenience of use
(particularly under MS-DOS), and bringing the version number to 4.1.
The new version has been posted to comp.sources.misc, and is available
from the htmlchek ftp archive site; the HTML version of the
documentation can be browsed over WWW.

   The htmlchek program checks for quite a number of possible defects
in the HTML (Hyper-Text Mark-up Language) version 2.0 SGML files used
on the World-Wide Web.  (Preliminary HTML 3.0 files for the Arena
browser, or files with Netscape extensions, can also be checked by
specifying the appropriate options -- however, this version of
htmlchek came out too soon to be able to include the new Netscape 1.1
beta tags.)  The program makes no claim to understand all of SGML, but
is easy and relatively simple to use, gives lots of information
(including about many stylistically bad practices), and can do local
cross-reference checking and generate rudimentary reference-dependency
maps.  And it can be run on any platform for which either an AWK or
Perl language interpreter is available (this includes everything from
Macintoshes to Unix to MS-DOS to VMS).

   The htmlchek package also includes a number of supplemental
utilities, including the htmlsrpl.pl HTML-aware search-and-replace
program, which uses either literal strings or regular expressions;
acts either only outside HTML/SGML tags, or only within tags; can be
restricted to operate only within and/or only outside specified
elements; and can also upper-case tag names.  Other utilities are:

    makemenu -- Makes simple menu for HTML files, based on each file's <TITLE>;
		  can also make a simple table of contents based on <H1>-<H6>
                  headings.
 xtraclnk.pl -- Extracts links/anchors from HTML files; isolates text
		  contained in <A> and <TITLE> elements.
      dehtml -- Removes all HTML markup, preliminary to spell check.
      entify -- Replaces high Latin-1 alphabetic characters with ampersand
                  entities for safe 7-bit transport.
    metachar -- Trivial program to protect HTML/SGML metacharacters "&<>" in
                  plain text that is to be included in an HTML file.


   The documentation for htmlchek can can be browsed over the Web at
the following URL:

 http://uts.cc.utexas.edu/~churchh/htmlchek.html

   The anonymous ftp site (courtesy of Davin Milun) is at the
following URL (also known as directory /pub/htmlchek at
ftp.cs.buffalo.edu):

 ftp://ftp.cs.buffalo.edu/pub/htmlchek/

   The following files are available there:

        htmlchek.tar.Z   Compressed versions of htmlchek with Unix (LF)
        htmlchek.tar.gz   line  breaks.  Use "uncompress htmlchek.tar.Z"
                          or "gunzip htmlchek.tar.gz" to uncompress, then
                          "tar xvf htmlchek.tar" to extract.  (Due to
                          incompatibility between Unix tar's, the
                          htmlchek subdirectory may untar as a normal
                          file; to avoid this problem, just mkdir
                          htmlchek manually before untar-ing.  Sorry!)

        htmlchek.zip     Compressed version of htmlchek with MS-DOS (CR-LF)
                          line breaks.  Use  "pkunzip htmlchek.zip" or
                          "unzip htmlchek.zip" to extract.

   Changes in this release include:

Don't warn about null <TEXTAREA></TEXTAREA> element; only check for
inappropriate whitespace within elements commonly rendered as
underlined (<A> and <U>); check ordering of head tags before body tags
even in absence of explicit <head>...</head>; allow comments between
list items; only output non-numeric unquoted option values in each
file; corrected processing of HTML3 <LH>; updated HTML 3 language
definition to January 19 1995 draft; tinkered with Netscape extensions
language-definition yet again; added inline=1 command-line parameter;
added listfile=/lf= command-line parameter (especially for greater
MS-DOS convenience); allow cf= as abbreviation of configfile=;
ampersands followed by non-alphabetics generate warnings rather than
errors (so corresponding erromessage was removed from entify); added
"changed"/"unchanged" STDERR messages to htmlsrpl.pl output; added
.gif's to documentation; added awk-perl.html to documentation; added
index.html menu to documentation.

   New files in this release are:

     README.41    Update notes
      index.html  HTML version of README.40, README.41, and menu
   awk-perl.html  Where to obtain Awk and Perl
     geterr.sh    Trivial script to extract only ERROR! messages
                    from htmlchek output
   geterwrn.sh    Trivial script to extract only ERROR!/Warning!
                    messages from htmlchek output
                  ___
        awk.gif      |    .gif files used
      camel.gif      |     in htmlchek HTML
        ftp.gif      |     documentation  
   htmlchek.gif      |    (uuencoded as .uue
   htmlchks.gif      |     files in the
   valdhtml.gif      |     comp.sources.misc
    warning.gif   ___|     Usenet distribution)


   The htmlchek program performs a fairly comprehensive job of
checking for HTML errors, but does not always exactly follow the
official standard (currently this is version 1.24 of the HTML 2.0
DTD).  Bad stylistic practices are warned against, as well as actual
HTML errors, and in some cases htmlchek is stricter than the standard,
in order to accommodate the peculiarities of some browsers.  The idea
is that HTML code should be ruggedized for the real world, rather than
just being SGML-ically correct -- especially since the official
standard allows many SGML features which are hardly understood by any
HTML-specific applications; for example, according to the official
standard the following is a completely valid HTML 2.0 file (without
even any omitted tags!):

   <>&t;HEAD/<TITLE///<BODY/text<IMG TOP SRC=x.gif<![IGNORE[ </HTML>]]>/</>

--
         --Henry Churchyard     [email protected]