Neatening HTML - Evaluation

This page is part of Neatening HTML, a dissertation on the subject of the automated processing and correction of poor HTML code. It should be noted that the dissertation was written in early 2004, and so some of its content may no longer be relevant.

4. Evaluation

Testing was performed in two stages. The first stage was testing of the individual parts, which was mentioned briefly in each subsection of the Implementation and will be described in detail here. The graphics user interface was used in this stage of testing. The second stage was a mass test in which a modified version of the program downloaded several thousand pages from the internet and noted how often it could successfully replace tables used for layout purposes. The final section of the Evaluation includes screenshots to demonstrate how closely output matches the original appearance of the documents.

4.1. The HTML Parser and Pretty Printer

Testing on constructed documents

The first stage of testing the HTML Parser and Pretty Printer involved testing on specially constructed documents - both valid documents and those demonstrating the various problems the parser should have been able to fix.

The first file used was a simple file, already valid and neat, demonstrating the three data types - HTML elements, CData and special sections:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <title>
      The page title
    </title>
  </head>
  <body>
    <!-- a comment -->
  </body>
</html>

The JTree showed the expected document tree and the pretty printer's output exactly matched the input.

The next files contained the common errors of missing opening tags, missing closing tags, incorrect nesting of tags, and unquoted attributes, both individually and in combinations. In all cases the parser and pretty printer behaved as corrected.

Testing on 'real' documents

It is one thing to test a program on data it was designed for, it is quite another to test it on real data. The second stage involved testing on a selection of documents produced by HTML editors and by people. This uncovered a number of problems, mentioned briefly in the Implementation section and corrected before proceeding on to the next part of the implementation.

Conditional comments and sever-side scripts were misinterpreted as HTML elements, so a document containing:

<?php a="b" ?>

Would have in the neatened version:

<?php a="b" ?="">

As it was interpreted as a <?php> element with two attributes. After making the changes described in the Implementation, the neatened version would contain the same code as the original.

Some confused authors used XHTML-style syntax for empty elements, which also confused the parser, so a document containing:

<img src="image.jpg" />

Would have in the neatened version:

<img src="image.jpg" /="">

As the backslash was interpreted as an attribute. After making the changes described in the Implementation, the neatened version no longer contained the backslash.

When a script contained an unescaped occurrence of />, the rest of the script after that point was parsed as normal HTML. After a change to the parser to treat all data up to the closing </script> tag as CData, this problem no longer occurred.

After this stage of testing, the following criteria of the requirements analysis were fulfilled:

  • correct common syntax errors (including all those corrected by HTMLTidy) by:
    • inserting missing closing tags
    • removing closing tags without matching opening tags
    • correcting nesting of elements
    • quoting attributes
    • replacing invalid characters with numeric character entity references

4.2. Code to remove or replace elements and attributes

This section of the project consisted of a very large number of simple transformations - decided how elements should be transformed was what made this section complicated. The first stage of testing consisted of checking that elements are replaced in the way described in the Implementation. In all cases the program performed as expected.

The second stage involved validating the documents produced. The W3C HTML Validator was used for this purpose. After the code to remove or replace invalid elements and attributes was implemented, the documents validated as HTML 4.01 Transitional. This was expected, as the program removed all elements it didn't recognise. After the code to remove or replace presentational elements and attributes was implemented, the documents validated as HTML 4.01 Strict. Approximately 50 'real' (that is, not purposefully constructed) documents were used at this stage.

After this stage of testing, the following criteria of the requirements analysis were fulfilled:

  • correct common syntax errors (including all those corrected by HTMLTidy) by:
    • removing non-existent elements and attributes
    • replacing non-standard elements (for example, <layer>) with the standards-compliant equivalent (for example, <div>)
  • correct those accessibility problems described in the Web Content Accessibility Guidelines that can be corrected automatically by software:
    • 3.2 Create documents that validate to published formal grammars
    • 3.3 Use style sheets to control layout and presentation
    • 7.4 Do not create periodically auto-refreshing pages.
    • 11.2 Avoid deprecated features of W3C technologies.
  • output valid HTML in a readable form

4.3. Code to replace layout tables

Testing of this part of the project was done in two stages. The first stage involved testing on constructed and real world documents. After this it emerged that some tables cannot be replaced, as described in detail in the Implementation. The second stage was the mass test, which will be described in detail later (in section 4.5).

4.4. Code to replace frames

Documents that use both cascading stylesheets and framesets are very rare, as most authors who know how to use cascading stylesheets will use these instead of frames. Because of this, testing of this section of the program relied mainly on constructed documents designed to be very difficult to convert.

The conversion to <div>s was always successful as a result of the similar structures of framesets and <div>s (that is, the division into horizontal and vertical strips). Some framesets (including almost half of the real-world framesets tested) could not be converted due to untargeted links to external sites resulting in a potentially millions of documents. The Frame Corrector successfully detected these cases, so meets the criteria of the requirements analysis.

This screenshot demonstrates the results of applying the Frame Corrector to a real world page (note that the pages had to be edited slightly to remove links to external sites without the target attribute set):

The left-hand section of the page (including a menu unfortunately made up of images) was originally a frame, but in <div> form is indistinguishable. Due to Internet Explorer's incorrect implementation of cascading styles, it scrolls with the page, but in Mozilla it is fixed as the frame was.

4.5. The mass test

The mass test was the most ambitious test of the program. A modified version of the program download approximately 4000 random web pages (selected using a method described in Preparation), and attempted to neaten them, recording its success rates.

The program successfully parsed and outputted all except two of the documents. It appears that Java misinterpreted the characters sets used in those two documents, leading to the parser being unable to parse them correctly.

The Table Corrector performed well, and successfully replaced 94% of the tables it encountered. Approximately a fifth of the tables it could not replace were well-defined but not dividable, and the other four fifths were not well-defined.

This meant that the following criterion from the requirements analysis, while not fully fulfilled, was fulfilled to the greatest extent possible in retrospect:

  • identify tables used for layout purposes and replace them with code using <div>s and cascading stylesheets

4.6. Screenshots

Neatening HTML code is pointless if the resultant documents look awful in web browsers. The following pages contain screenshots that show the effect of the neatening on a selection of famous websites, with a brief explanation of the differences in appearance.

Google

The Table Corrector successfully replaced the table Google uses for the <form> element. As a consequence of using floated elements, the table is no longer centred. By triggering Internet Explorer's 'standards-compliance' mode (because the document is now valid HTML) the text size has changed.

Netcraft

The Netcraft website demonstrates one problem with replacing layout tables with <div>s - browsers' algorithms for laying out tables are more advanced than those for <div>s. Here the display is good, except for the fact that the menu appears above the contents, due to the fact that some of the menu items are very long - browsers would have wrapped the lines earlier if they had been in a table. Note that the change in elements means that the DHTML no longer works, resulting in 'Done, but with errors on page.' being displayed in the message bar. Correcting DHTML is beyond the scope of this project.

Cambridge website

The website of the university demonstrates a slightly different problem. The two menus that should appear side-by-side appear one below the other. In this case it is not because of lines that are too long, but because the site uses nested tables and the use of floated elements by the Table Corrector means that what was the right-hand menu is moved to be below what was the left-hand menu.

This article was last edited on 16th October 2007. The author can be contacted using the form below.
Back to home page
Bookmark with: