From [email protected] Tue Mar 21 13:44:40 1995 Article: 24135 of comp.infosystems.www.misc From: [email protected] (Henry Churchyard) Subject: ANNOUNCE: htmlchek - HTML error checker and utilities - ver. 4.1 Date: 20 Mar 1995 21:50:29 -0600 Organization: The University of Texas at Austin; Austin, Texas A moderately significant bugfix and update to my htmlchek HTML error checker program (and supplemental utilities) has been released, adding several minor features for greater convenience of use (particularly under MS-DOS), and bringing the version number to 4.1. The new version has been posted to comp.sources.misc, and is available from the htmlchek ftp archive site; the HTML version of the documentation can be browsed over WWW. The htmlchek program checks for quite a number of possible defects in the HTML (Hyper-Text Mark-up Language) version 2.0 SGML files used on the World-Wide Web. (Preliminary HTML 3.0 files for the Arena browser, or files with Netscape extensions, can also be checked by specifying the appropriate options -- however, this version of htmlchek came out too soon to be able to include the new Netscape 1.1 beta tags.) The program makes no claim to understand all of SGML, but is easy and relatively simple to use, gives lots of information (including about many stylistically bad practices), and can do local cross-reference checking and generate rudimentary reference-dependency maps. And it can be run on any platform for which either an AWK or Perl language interpreter is available (this includes everything from Macintoshes to Unix to MS-DOS to VMS). The htmlchek package also includes a number of supplemental utilities, including the htmlsrpl.pl HTML-aware search-and-replace program, which uses either literal strings or regular expressions; acts either only outside HTML/SGML tags, or only within tags; can be restricted to operate only within and/or only outside specified elements; and can also upper-case tag names. Other utilities are: makemenu -- Makes simple menu for HTML files, based on each file's <TITLE>; can also make a simple table of contents based on <H1>-<H6> headings. xtraclnk.pl -- Extracts links/anchors from HTML files; isolates text contained in <A> and <TITLE> elements. dehtml -- Removes all HTML markup, preliminary to spell check. entify -- Replaces high Latin-1 alphabetic characters with ampersand entities for safe 7-bit transport. metachar -- Trivial program to protect HTML/SGML metacharacters "&<>" in plain text that is to be included in an HTML file. The documentation for htmlchek can can be browsed over the Web at the following URL: http://uts.cc.utexas.edu/~churchh/htmlchek.html The anonymous ftp site (courtesy of Davin Milun) is at the following URL (also known as directory /pub/htmlchek at ftp.cs.buffalo.edu): ftp://ftp.cs.buffalo.edu/pub/htmlchek/ The following files are available there: htmlchek.tar.Z Compressed versions of htmlchek with Unix (LF) htmlchek.tar.gz line breaks. Use "uncompress htmlchek.tar.Z" or "gunzip htmlchek.tar.gz" to uncompress, then "tar xvf htmlchek.tar" to extract. (Due to incompatibility between Unix tar's, the htmlchek subdirectory may untar as a normal file; to avoid this problem, just mkdir htmlchek manually before untar-ing. Sorry!) htmlchek.zip Compressed version of htmlchek with MS-DOS (CR-LF) line breaks. Use "pkunzip htmlchek.zip" or "unzip htmlchek.zip" to extract. Changes in this release include: Don't warn about null <TEXTAREA></TEXTAREA> element; only check for inappropriate whitespace within elements commonly rendered as underlined (<A> and <U>); check ordering of head tags before body tags even in absence of explicit <head>...</head>; allow comments between list items; only output non-numeric unquoted option values in each file; corrected processing of HTML3 <LH>; updated HTML 3 language definition to January 19 1995 draft; tinkered with Netscape extensions language-definition yet again; added inline=1 command-line parameter; added listfile=/lf= command-line parameter (especially for greater MS-DOS convenience); allow cf= as abbreviation of configfile=; ampersands followed by non-alphabetics generate warnings rather than errors (so corresponding erromessage was removed from entify); added "changed"/"unchanged" STDERR messages to htmlsrpl.pl output; added .gif's to documentation; added awk-perl.html to documentation; added index.html menu to documentation. New files in this release are: README.41 Update notes index.html HTML version of README.40, README.41, and menu awk-perl.html Where to obtain Awk and Perl geterr.sh Trivial script to extract only ERROR! messages from htmlchek output geterwrn.sh Trivial script to extract only ERROR!/Warning! messages from htmlchek output ___ awk.gif | .gif files used camel.gif | in htmlchek HTML ftp.gif | documentation htmlchek.gif | (uuencoded as .uue htmlchks.gif | files in the valdhtml.gif | comp.sources.misc warning.gif ___| Usenet distribution) The htmlchek program performs a fairly comprehensive job of checking for HTML errors, but does not always exactly follow the official standard (currently this is version 1.24 of the HTML 2.0 DTD). Bad stylistic practices are warned against, as well as actual HTML errors, and in some cases htmlchek is stricter than the standard, in order to accommodate the peculiarities of some browsers. The idea is that HTML code should be ruggedized for the real world, rather than just being SGML-ically correct -- especially since the official standard allows many SGML features which are hardly understood by any HTML-specific applications; for example, according to the official standard the following is a completely valid HTML 2.0 file (without even any omitted tags!): <>&t;HEAD/<TITLE///<BODY/text<IMG TOP SRC=x.gif<![IGNORE[ </HTML>]]>/</> -- --Henry Churchyard [email protected]