This page is part of Neatening HTML, a dissertation on the subject of the automated processing and correction of poor HTML code. It should be noted that the dissertation was written in early 2004, and so some of its content may no longer be relevant.
2. Preparation
2.1. Starting point
Before starting the project I was already familiar with the various HTML specifications produced by the W3C, as well as the proposals of the Web Accessibility Initiative, and I was also aware of what the 'advanced' (for want of a less elitist word) community of HTML writers considered good and bad practice (through the comp.infosystems.www.authoring.* newsgroups). At the time I had not produced any code that was later used in the project, except for a generic tree class.
To further clarify the project proposal, an analysis of the different kinds of 'bad' or 'messy' HTML is necessary. This was done through a combination of reading documents on this topic, and looking at the HTML generated both by people (on the personal home pages of staff on the Cambridge Computer Laboratory's website) and by a selection of HTML editors (including FrontPage, Netscape/Mozilla Composer and Microsoft Word). The problems fall into three categories: syntactical errors, semantic errors, and accessibility problems.
2.2. Syntactical errors
The most frequently occurring problem in HTML documents is invalid syntax. Some of these errors occur mainly in hand-written HTML, including:
-
Missing closing tags – although some closing tags are optional, many authors
routinely leave out closing tags. Many do this because they think of tags in
terms of their effects or as instructions, so that, for example,
</i>makes sense to them as 'turn off italics', but</p>seems unnecessary (they think<p>means 'leave a blank line'). -
Incorrect nesting of elements – this can also occur as a result of authors
thinking of tags as instructions (in which case
<b><i></b></i>would mean the same as<b><i></i></b>), although sometimes it happens when authors forget in what orders they opened tags.
Other syntactical errors occur both in hand-written HTML and HTML generated by authoring tools (usually as a result of the programmers of the authoring tools not understanding the specifications), including:
- Certain unquoted attributes – in most cases it is safe not to quote attributes, but because HTML takes advantage of SGML's 'shorttag' property this causes problems if the attribute values contains the stroke character, used in web addresses [1].
- Invalid characters – editors on the Windows platform routinely use characters not permitted in HTML, including Microsoft's 'smart quotes'.
- Invalid elements and attributes – although mistyping causes these errors occasionally, frequently they are the result of using elements or attributes invented by browser manufacturers.
The parser used in the project must be able to cope with these errors, and as
a result a normal SGML or HTML parser cannot be used (an SGML parser would be
confused by the unquoted attributes described above, whereas an HTML parser
would, for example, interpret
<span><p></p></span> as
<span></span><p></p>, when the author
almost certainly meant
<div><p></p></div>).
References
- [1] The Saga Of The Slashed Validators - http://www.cs.tut.fi/~jkorpela/qattr.html
2.3. Semantic errors
When writing about semantic errors, there are two overlapping viewpoints to consider: the purist perspective, described in the following section, and the usability perspective, described in the section after. The project aims to produce code that would satisfy both groups.
The purist perspective
Purists believe that HTML documents should not only validate (that is, contain no syntax errors), but should also follow the specification in terms of semantics. Although the specifications are frequently vague, they do make many statements with regard to the meaning of the elements, including:
- HTML elements (with the exception of some deprecated elements) are logical, not presentational, and cascading stylesheets (CSS) should be used to suggest presentation.
- Tables should be used only for tabular data, and never for layout purposes.
The purist perspective has the problem that HTML does not contain enough logical mark-up to always be used purely logically (for example, the mark-up for definitions is inadequate), and few browsers fully support cascading stylesheets.
The usability perspective
Usability experts take a different, more practical view. Although compliance with the standards is still important, it is more important to make HTML documents and websites usable. The Web Content Accessibility Guidelines are written from a purist perspective, and say, for example:
- If tables are used for layout then they must still make sense when linearised (as this is how a blind person would experience them).
- Avoid deprecated features of the W3C specifications (as their behaviour in some browsers is unpredictable).
In some areas, usability experts are stricter than purists – for example, although frames are part of the HTML specification, purists would advise against using them due to well-documented problems (including issues with bookmarking and printing framesets).
2.4. Code layout
It is easier to maintain HTML documents when the source is written neatly. Although whitespace does not affect the presentation of the document in a web browser, this issue is relevant to the project's aim of 'neatening HTML'. Consider the following two pieces of code - both are equivalent, differing only in whitespace, but the second version is far easier to read, showing the structure of the document:
<dl><dt>HTML</dt><dd>abbr. <img src=technical.png>Hypertext Mark-up Language</dd><dt>HW</dt><dd>abbr. Heavy-weight</dd></dl>
<dl>
<dt>
HTML
</dt>
<dd>
abbr.
<img src="technical.png">
Hypertext Mark-up Language
</dd>
<dt>
HW
</dt>
<dd>
abbr. Heavy-weight
</dd>
</dl>
2.5. Requirements analysis
After the analysis above, it is possible to refine the requirements given in the Introduction section.
The program should:
- take an HTML file as input
-
correct common syntax errors (including all those corrected by HTMLTidy) by:
- inserting missing closing tags
- removing closing tags without matching opening tags
- correcting nesting of elements
- quoting attributes
- replacing invalid characters with numeric character entity references
- removing non-existent elements and attributes
-
replacing non-standard elements (for example,
<layer>) with the standards-compliant equivalent (for example,<div>)
- replace the frames with code using <<div>s and cascading stylesheets where practical
- replace presentational mark-up with the equivalent CSS
-
identify tables used for layout purposes and replace them with code using
<div>s and cascading stylesheets -
correct those accessibility problems described in the Web Content
Accessibility Guidelines that can be corrected automatically by software:
- 3.2 Create documents that validate to published formal grammars
- 3.3 Use style sheets to control layout and presentation
- 5.3 Do not use tables for layout unless the table makes sense when linearized.
- 7.1 Avoid causing the screen to flicker.
- 7.4 Do not create periodically auto-refreshing pages.
- 11.2 Avoid deprecated features of W3C technologies.
- output valid HTML in a readable form
2.6. Test data
For most parts of the program specially constructed test documents can be used for initial testing, but testing must then be performed using real-world data. Fortunately there is no shortage of bad HTML – almost all of the worldwide web's estimated six billion pages are invalid. To get a relatively representative sample, the following procedure will be used:
- A list of about a hundred common (that is, non-technical) but relatively long words will be constructed. This can be created quickly by hand – looking at the above paragraph, for example, gives 'specially', 'performed', 'fortunately', 'shortage', 'estimated', 'relatively' and 'following'.
- A program searches for the words using a search engine such as Google or AllTheWeb.
- The program takes pages randomly from the entire lists of results, and uses those pages as test data.
2.7. Top-level design
The following top-level design seems most logical:

2.8. Software engineering model
The spiral model of software engineering is most appropriate for this project, as the project can clearly be divided into a number of sections, each of which consists of stages of analysing alternatives, implementing the code, and then testing it (the structure means each section can be tested as it is written). In more detail, these are the stages for each of the five sections:
-
The HTML parser (inputs HTML) and 'pretty printer' (outputs HTML):
- Create data structures for HTML document trees and decide on form of parser
- Implement data structures, HTML parser and 'pretty printer'
- Test that syntax errors described in the requirements are corrected
-
Code to remove or replace invalid elements:
-
Determine the most appropriate replacements (this will usually involve
use of either cascading stylesheets or the
<object>tag) - Implement code to crawl the document tree removing or replacing invalid elements
- Test that documents produced are valid HTML (transitional, as deprecated elements are not removed at this stage)
-
Determine the most appropriate replacements (this will usually involve
use of either cascading stylesheets or the
-
Code to replace presentational elements with cascading stylesheets
-
Determine the most appropriate replacements, recalling the Web Content
Accessibility Guidelines (for example, the Cascading Stylesheets Level 2
specification allows the property text-decoration:blink, but this should
not be used as a replacement for the
<blink>tag as it causes the screen to flicker) - Extend tree crawling code described above to also perform these replacements
- Test that (subject to changes required by the Web Content Accessibility Guidelines) the appearance of pages matches their appearance before the replacements
-
Determine the most appropriate replacements, recalling the Web Content
Accessibility Guidelines (for example, the Cascading Stylesheets Level 2
specification allows the property text-decoration:blink, but this should
not be used as a replacement for the
-
Code to replace layout tables with code using
<div>s and cascading stylesheets-
Create heuristics that can reliably detect tables used for layout
purposes, and create algorithms to determine the combination of nested
<div>s that will give the same appearance - Implement the heuristics and replacement algorithms
- Test that the heuristics work reliably (false negatives can be tolerated, but not false positives), and that the replacement for the table has an appearance as similar as possible to the original
-
Create heuristics that can reliably detect tables used for layout
purposes, and create algorithms to determine the combination of nested
-
Code to replace frames with code using
<div>s and cascading stylesheets where practical- Create algorithms that can efficiently determine the possible combinations of documents that could be displayed in the frames (each combination will require a separate output file), and algorithms to convert the frameset (frames are specified in a different way from table cells, so that previous algorithms for tables cannot be used)
- Implement the replacement algorithm
- Test that the documents produced behave as the frameset did (links must load the appropriate document)