Neatening HTML - Implementation

This page is part of Neatening HTML, a dissertation on the subject of the automated processing and correction of poor HTML code. It should be noted that the dissertation was written in early 2004, and so some of its content may no longer be relevant.

3. Implementation

The spiral model of software engineering was chosen for the project, and the structure of this chapter will reflect that - it will be divided into five subsections, each detailing the analysis and choice of solution, and then the changes made after testing. An additional subsection briefly describes the interface added to aid in testing. Testing will be described in detail in the Evaluation chapter.

3.1. The HTML parser and 'pretty printer'

The first thing to consider was the data structure to be used to represent the HTML internally. Written HTML is the linearization of a document tree, and so a tree data structure provides the greatest flexibility when performing modifications. I had already written a class called NaryTreeNode prior to the project, which was suitable for this task. It represents a tree node with an arbitrary number of children. Each node can contain a Java Object. Its method for adding children ensures that the tree structure is maintained, so any tree-walking algorithm is guaranteed to terminate. Three types of data feature in HTML documents:

  • HTML elements (for example <a href="http://www.safalra.com/">)
  • Special sections (for example <!-- comment --> or <!DOCTYPE...)
  • CData (any text that is not an HTML element or a special section)

Classes would be needed to represent these three types of data.

HTML elements

The parser would create the HTML element objects as it reads the document. As a result, the constructor does not initialise the values of the fields - separate methods are used to set the name of the element (for example, 'a'), add new attribute names (for example, 'href'), and add new attribute values (for example 'http://www.safalra.com/') - and to allow easy modification, attribute values can be changed once they are initialised. The Java Vector class (part of the java.util package) is suitable for storing attribute names and values - two Vectors are used, one for the names and one for the values, with the ith value corresponding to the ith name - the indexOf() method makes it easy to find the appropriate position for a value given the name. Vectors also expand automatically as more attributes are added.

Special sections

A special section consists of a start marker (for example <!--), some text, and an end marker (for example -->). Having read the start marker of a special section, the parser would continue in the same state (see the description of the parser state machine below) until reading the end marker. This means that the markers and text can be specified in the constructor for special sections. By using the CData class to store the text, errors such as invalid characters can be corrected without duplicating the correction code in the special section class.

CData

The CData class represents CData in a canonical form. It compresses consecutive whitespace characters into a single space, and also corrects invalid characters by replacing them with the appropriate numeric character entity reference.

The parser

The HTML parser was implemented as a state machine with seven states, using the Java switch/case statements. The states and their transitions are (if there are no matching transitions, the state does not change):

  • Reading text
    • Read < change to Reading tag
  • Reading tag
    • Read > change to Reading text
    • Read ! change to Reading text
    • Read / change to Reading end tag
    • Read whitespace, change to Reading attribute name
  • Reading end tag
    • Read > change to Reading text
  • Reading attribute name
    • Read > change to Reading text
    • Read whitespace, change to Reading attribute value
    • Read = change to Reading attribute name
  • Reading attribute value
    • Read > change to Reading text
    • Read whitespace, change to Reading attribute name
    • Read " change to Reading quoted attribute value
  • Reading quoted attribute value
    • Read " change to Reading attribute name

The HTML tree builder

Many of the transitions involve making changes to the document tree. To keep the HTML parser's class readable and manageable, and to avoid it getting too large, a second class called the HTML Tree Builder was used. This not only builds the tree, but also corrects many syntactic errors as it builds it. Furthermore, its methods all take Strings as parameters, so that the parser is independent of the data structures used in the document tree. Two of its methods deserve special attention:

  • addElement adds an element to the tree. It converts the name of the element to lower case, leading to faster matching (the Java String class equals method can be used to compare element names) and more readable output (studies have shown that lower case letters are easier to read). It also checks if an element is empty (has no end tag), using the Java Arrays class binarySearch method on a static array in the HTML element class, so that children are added to its parent and not to the empty element.
  • closeElement is called by the parser when it has read an end tag. It simultaneously deals with the problems of missing closing tags, missing opening tags and incorrectly nested elements by seeking a matching element higher up the document tree, and closing all elements up to that point if a matching element is found, but having no effect if no element matches. This means that:
    • <b><i>bold and italic</b> (missing closing tag) is treated as <b><i>bold and italic</i><b> as the method closes the <i> element when it closes the <b> element.
    • <b>bold</i></b> (missing opening tag) is treated as <b>bold</b> as the </i> has no effect.
    • <b>bold <i>and italic</b></i> (incorrect nesting of tags) is treated as <b>bold <i>and italic</i></b> (this is actually a combination of the two cases above - there is a missing closing tag before the </b> and a missing opening tag for the </i>).

The 'pretty printer'

The pretty printer outputs the document to file by recursively calling a function on the document tree. First the <!DOCTYPE... line is outputted - the original implementation outputted a transitional DOCTYPE, and this was later changed to a strict DOCTYPE once code to remove or replace invalid elements and attributes was produced - this meant an HTML validator could be used in testing, by validating the output of the program. After outputted the DOCTYPE, it calls a recursive function, which can be summarised in pseudocode as:

recursiveOutput(spaces,node)
  if the node is CData
    output spaces followed by the CData
  else if the node is a special section
    output spaces, the start marker, CData and end marker
  else (the node is an HTML element)
    output spaces followed by < and the element name
    for each attribute
      output a space, attribute name, = and attribute value
    output >
    for each child of the node
      recursiveOutput(spaces+"  ",child)
    output spaces followed by </, the element name and >

The purpose of the 'spaces' variable is to indent the output in order to make it more readable - indentation makes it easier to find matching opening and closing tags, so documents are more likely to remain correct after they have been edited. As an example, the following output from the pretty printer...

<html>
  <head>
    <title>Page title</title>
  </head>
  <body>
    <p>
      Some text
    </p>
  </body>
</html>

...is easier to read than the non-indented version:

<html>
<head>
<title>Page title</title>
</head>
<body>
<p>
Some text
</p>
</body>
</html>

Changes after testing

Testing (described in detail the Evaluation chapter) revealed four cases where the parser misinterpreted the document. It should be noted that all of these are invalid HTML, and the parser behaved correctly on valid HTML, as well as on HTML with syntax errors of the type described above:

  • Conditional comments - a Microsoft construct that is intended to be compatible with traditional HTML comments (so that Microsoft could not be accused of encouraging non-standard code), but that many Microsoft programs (including some versions of Microsoft Word) output incorrectly - loading such a page in Opera and Mozilla showed that these cause problems for parsers in many web browsers. After a slight change to the section of the parser corresponding to the Reading tag state, the conditional comments were treated as special sections and no longer caused problems.
  • XHTML confusion - many page authors seem confused about XHTML, and frequently HTML pages include XHTML-style empty elements (these end with />) - the interpretation according to the specification differs from that of browsers, which ignore the mistake. The section of the parser corresponding to the Reading attribute name state was altered so that the / was ignored and no longer interpreted as an attribute name.
  • Buggy scripts - many inline scripts contain unescaped occurrences of </, which would be interpreted according to the specification as the end of the <script> element. A safe assumption to make is that all data up to the closing </script> should be considered literally. Small alterations to the sections of the parser corresponding to the Reading tag, Reading attribute name and Reading attribute value states called the function readUntil (already used by special sections) to do this.
  • Server-side scripts - usually these would not be seen by HTML parsers so could be ignored, but the program may be used to neaten unprocessed server-side documents. As such, these should be treated much like normal scripts, where all data up to the closing 'tag' is considered literally. This required two new transitions to be added to the Reading tag state, one for server-side PHP, which uses <? and ?> delimit its code, and one for ASP and JSP, which use <% and %> to delimit their code:
    • Read ? change to Reading text
    • Read % change to Reading text

3.2. Code to remove or replace invalid elements

Two classes were used to remove or replace invalid elements. The HTML Element Corrector class replaces a selection of invalid elements and some non-presentational deprecated elements with standards-compliant equivalents (the next section, 'Code to replace presentational elements with CSS', describes how this was extended to presentational elements).

The HTML Tree Corrector class removes any remaining invalid elements and corrects some other nesting problems (some of which may have been caused by the HTML Element Corrector). These nesting problems could have been corrected by the HTML Tree Builder, but that would involve duplicating some features of the HTML Element Corrector class in the HTML Tree Builder class, which is inefficient and could lead to errors if one of the two was updated and the other wasn't.

The HTML Element Corrector

The HTML Element Corrector uses a pre-order traversal of the document tree, as some elements to be replaced may contain other elements as children (for example, <applet> may contain <param>), and these should be replaced in the context of the parent elements. The traversal algorithm tests each node to see if it is an HTML Element, and then uses the Java Arrays class binarySearch method to see if the element is in an array of elements to be replaced (the array is a static variable within the class). If it is, the replaceElement method is called, which then replaces the element with the standards-compliant equivalent.

Deprecated elements from HTML 4.01

The deprecated <dir> and <menu> elements were the easiest to decide the replacement for, as the HTML 4.01 specification itself recommends one:

The DIR element was designed to be used for creating multicolumn directory lists. The MENU element was designed to be used for single column menu lists. Both elements have the same structure as UL, just different rendering. In practice, a user agent will render a DIR or MENU list exactly as a UL list.

We strongly recommend using UL instead of these elements.

So the HTML corrector replaces an occurrence of <dir> or <menu> with <ul>, and copies any attributes (<dir> and <menu> only supported the 'core', 'internationalisation' and 'events' attributed supported by every element).

<isindex> is a curiously named deprecated element that creates a single line text input control with a prompt specified by the 'prompt' attribute. This should be replaced by an <input> element of type 'text' (with all attributes except the 'prompt' copied over), with the prompt inside another containing element (the HTML Element Corrector uses a <div> - the other choice was <p> but the prompt would not usually constitute a paragraph). Thus a document containing:

<isindex prompt="Your name: ">

Would have in the neatened version:

<div>Your name: <input type="text"></div>

The deprecated <applet> element is the most complicated element to replace, as it is replaced by the <object> element, which does not have equivalents for all the attributes. Instead, some attributes become <param> elements that are children of the <object> element. In particular:

  • The 'codebase' attribute is replace by a <param> element with its 'name' attribute set to 'codebase' and its 'value' attribute set to the 'codebase' attribute's value.
  • The 'code' attribute is replace by a <param> element with its 'name' attribute set to 'code' and its 'value' attribute set to the 'code' attribute's value.
  • 'width', 'height' and the 'core', 'internationalisation' and 'events' attributes are copied over to the <object> element.
  • The <object> element has its 'classid' attribute set to 'clsid:CAFEEFAC-0014-0001-0001-ABCDEFFEDCBA' (the appropriate value for Java).

This requires more complex manipulation by the HTML Element Corrector, as the document tree's structure is changed. First it creates the <object> element, sets its attributes, and creates a tree node containing it. Then the <param> elements for the 'codebase' and 'code' attributes are created and their nodes added to the <object> node as children. After that, the <applet> node's children are added to the <object> node as children. Finally, the <applet> node is removed from the document tree and replaced by the <object> node. As an example, a document containing:

<applet codebase="apps" code="applet.java">
  <param name="anonymous" value="something">
  <p>Alternative text</p>
</applet>

Would have in the neatened version:

<object classid="clsid:CAFEEFAC-0014-0001-0001-ABCDEFFEDCBA">
  <param name="codebase" value="apps">
  <param name="code" value="applet.java">
  <param name="anonymous" value="something">
  <p>Alternative text</p>
</object>

Deprecated elements from HTML 3.2

The <xmp> and <listing> elements were deprecated in HTML 3.2 because of the new <pre> element. It should be noted that <xmp> (example output) and <listing> (program source) were logical elements, and that their replacement by the presentation <pre> was a misjudgement. However, if the program is to produce valid HTML they must be replaced in the document tree by <pre>.

The <plaintext> element was one of the worst features of HTML until it was deprecated in HTML 3.2. In 1991 Tim Berners-Lee described it in an e-mail:

<PLAINTEXT> is used to indicate that the rest of the file is
in fact just ASCII. It turns off SGML parsing completely.
It's a fudge for the moment, until we have the document
format negociation [sic].

Unfortunately it remained until 1997. There is no exact replacement, but in almost all real-world uses it can be safely replaced by <pre>, and this is how the HTML Element Correct behaves.

Deprecated elements from HTML 2.0

HTML 2.0 contained an element called <nextid>, for historical purposes only. It has no effect whatsoever in any HTML browser (it was inherited from the SGMLguid language on which HTML was based, in which is used to suggest the next name for an anchor), so the HTML Element Corrector removes it.

Deprecated attributes

Numerous attributes have been deprecated over the HTML specifications. The following list gives a brief summary of what the HTML Element Corrector does in response to each of them:

  • The 'compact' attribute suggested browsers should render a list in a more compact way. This is purely presentational, and it was never clearly defined how the list should be compacted, so it was deprecated. The HTML Element Corrector removes this attribute.
  • The 'language' attribute specified the language of a script from a predefined list. The list was never standardised (only 'javascript' is widely recognised) so this attribute was replaced with a type attribute specifying a MIME-type. The HTML Element Corrector performs this conversion, replacing, for example, language="javascript" with type="text/javascript".
  • The 'noshade' attribute suggested browsers should render a horizontal rule in one colour. This was deprecated as no rendering for a normal horizontal rule was specified. The HTML Element Corrector removes this attribute.
  • The 'nowrap' attribute instructed browsers not to wrap text in a table cell. This led to excessively wide tables cells so was deprecated. The HTML Element Corrector removes this attribute.
  • The 'version' attribute specified which version of HTML the document uses. This is redundant now that the DOCTYPE statement exists. The HTML Element Corrector removes this attribute (note that it would be removed even if it had not been deprecated, as the project will output valid HTML 4.01 using the appropriate DOCTYPE).
  • The 'methods' attribute was introduced in HTML 2.0 to specify action to be taken on the target of a link, but was never properly defined semantically and disappeared from the language without ever being deprecated. It has no effect whatsoever in any HTML browser, and the HTML Element Corrector removes it.

Non-standard elements

As well as the element and attributes that the standards have deprecated, there are a number of elements and attributes that were never included in any HTML standard.

The <embed> element was introduced by Netscape (in Navigator 2.0) to include in a page objects that use plug-ins. This has been superseded by the <object> element. The <embed> 'src' attribute is equivalent to <object> 'data' attribute, and the 'width' and 'height' attributes are the same in both.

The <layer> element, introduced by Netscape, is the most complicated to replace of the non-standard elements. The standards-compliant equivalent is <div>, which doesn't support any of the attributes of <layer>. Instead, cascading stylesheets are used, in this case through the 'style' attribute, to position the <div>. If the <layer> had the 'width' or 'height' attributes set, these become 'width' or 'height' properties in CSS. If it had the 'top' or 'left' attributes set, the CSS 'position' property must be set to 'absolute', and the CSS 'top' or 'left' property used. The 'bgcolor' attribute is replaced by the CSS 'background-color' property, and the 'background' attribute is replaced by the CSS 'background-image' property (in both these cases, the CSS shorthand 'background' property can be used). As an example, a document containing:

<layer left="25%" width="50%" bgcolor="red">
  I'm read and take up half the page.
</layer>

Would have in the neatened version:

<div style="position:absolute;left:25%;width:50%;background:red;">
  I'm red and take up half the page.
</div>

The HTML Tree Corrector

The HTML Tree Corrector uses a post-order traversal of the document tree. If an HTML element's parent is invalid, or it cannot contain the HTML element as a child, the child is moved to be a child of its grandparent, immediately following its parent. Static arrays of valid HTML elements and the elements they can contain added to the HTML element class, and the Java Arrays class binarySearch method, were used to facilitate this. If the HTML element's parent was invalid, it is then deleted. The post-order traversal allows elements to be propagated up the tree - if an element cannot be a valid child of it's parent, but it has no grandparent (that is, its parent is the <html> element), it is deleted - so, for example, a paragraph of text in the <head> section of the document is removed from the document tree entirely.

Changes after testing

Testing (described in detail in the Evaluation section) on examples of code including the elements and attributes described above showed the HTML Element Corrector worked as expected. However, when testing on a wider range of documents, one further example of a non-standard element was found, described in the next paragraph.

The <comment> element was invented by Microsoft for no apparent reason; syntax for comments already exist, and browsers display the contents of the <comment> element if they don't recognise it, making it less reliable than normal comments. This element can only have CData as a child, so the HTML Corrector replaces <comment> with a special section representing a comment with this element's CData child as the special section's CData.

With the HTML Element Corrector replacing elements correctly, and the HTML Tree Corrector removing any remaining invalid elements, the program could successfully output valid HTML Transitional.

3.3. Code to replace presentational elements with CSS

This section of the project involved extending the HTML Element Corrector to replace presentational elements with an equivalent involving cascading stylesheets - this took the form of a <div> or <span> element with its 'style' attribute set appropriately - and replacing presentational attributes by the extending the value of the 'style' attribute with the appropriate properties. When deciding replacements, recommendations of the Web Accessibility Initiative were considered.

Presentational elements

Although <blink> can be replaced by the 'text-decoration:blink' property of cascading stylesheets, the Web Content Accessibility Guidelines advise: '7.1 Avoid causing the screen to flicker'. Because of this, <blink> elements are removed from the document tree and their children moved to be children of the <blink> elements' parents. For similar reasons, the <marquee> element is also removed.

The <b>, <center>, <i>, <nobr>, <s>, <strike> and <u> elements all have simple CSS equivalents:

  • <b> is equivalent to <span style="font-weight:bold;">
  • <center> is equivalent to <div style="text-align:center;">
  • <i> is equivalent to <span style="font-style:italic;">
  • <nobr> is equivalent to <div style="whitespace:nowrap;">
  • <s> is equivalent to <span style="text-decoration:line-through;">
  • <strike> is equivalent to <span style="text-decoration:line-through;">
  • <u> is equivalent to <span style="text-decoration:underline;">

The <font> element has a number of attributes, all of which have direct equivalents in cascading stylesheets:

  • The 'size' attribute has a value from 1 to 7. In cascading stylesheets these correspond to the values for the 'font-size' property of 'xx-small', 'x-small', 'small', 'medium', 'large', 'x-large' and 'xx-large' - using a static array of these names makes replacement easy.
  • The 'face' attribute is equivalent to the cascading stylesheets 'font-family' property.
  • The 'color' attribute is equivalent to the cascading stylesheets 'color' property.

Presentational attributes

The deprecated presentational attributes all have simple equivalents in cascading stylesheets.

  • The 'background' attribute is equivalent to the cascading stylesheets 'background-image' property.
  • The 'bgcolor' attribute is equivalent to the cascading stylesheets 'background-color' property.
  • The 'border' attribute of the <img> element is equivalent to the cascading stylesheets 'border-width' property.
  • The 'clear' attribute is equivalent to the cascading stylesheets 'clear' property.
  • The 'color' attribute is equivalent to the cascading stylesheets 'color' property.
  • The 'face' attribute is equivalent to the cascading stylesheets 'font-family' property.

3.4. Code to replace layout tables

A brief comparison of tables and <div>s

Before explaining the algorithms, it is necessary to briefly compare the structuring of data using tables and using <div>s.

A table consists of rows, indicated by <tr> elements, each of which contains cells, indicated by <td> elements. Tables can also contain elements such as <thead>, <tbody>, <tfoot>, and <th>; however, these are only used in true tables (those used for tabular data), and not the layout tables the project aims to replace.

Tables are made more complicated by the fact that cells can have a 'rowspan' (and are more than one cell high) and a 'colspan' (and are more than one cell wide). As a result, the first <td> element inside a <tr> element may not represent the first cell in a row, as a cell from a preceding row may extend into this row.

The <div> element is much simpler. A column of <div>s is made by placing the <div>s one after another, with their 'style' attributes set to 'width:100%' (making them the same width as their parent element):

<div>
  <div style="width:100%">Top div</div>
  <div style="width:100%">Middle div</div>
  <div style="width:100%">Bottom div</div>
</div>

A row of <div>s is created by placing the <div>s one after another, with their 'style' attributes setting their 'height' property to '100%' (making them the same height as their parent element) and setting their 'float' property to 'left' (making each <div> appear to the right of the previous one):

<div>
  <div style="height:100%;float:left">Left div</div>
  <div style="height:100%;float:left">Middle div</div>
  <div style="height:100%;float:left">Right div</div>
</div>

The divide and conquer algorithm

The HTML Table Corrector class uses a divide and conquer algorithm to convert tables to <div>s and cascading stylesheets. It divides the table into horizontal or vertical strips, which can be converted to <div>s and cascading stylesheets as described above. The algorithm is then called recursively on the remaining sections of the table, until only a single cell is left, which is replaced with a <div> whose children are the children of the original table cell. As an illustration, consider the following table:

It can be divided into two vertical strips:

The left strip can be divided into two horizontal strips, and the right strip can be divided into three horizontal strips:

The strip in the bottom right can be divided once more, at which point all strips represent single cells.

The table data structure

For the divide and conquer algorithm to work efficiently, an appropriate data structure representing tables is needed. The data structure chosen consists of a two-dimensional array whose width and height are equal to the width and height of the table (in terms of cells). Each entry in the array contains a reference to the document tree node containing the cell covering that position (cells whose 'rowspan' or 'colspan' are greater than 1 may cover multiple positions) and the cell's right and bottom boundaries. For example, consider the following table:

The data structure would represent it as:

The divide and conquer algorithm can use these values to determine where to split the table - for example, if all cells in a column have a 'right' value equal to the column's number, a split can be made after that column.

Creating the data structure

It is easy for a human, given the diagram above, to create the structure that represents it. The HTML Table Corrector, however, needs to create the appropriate data structure directly from the document tree. This is done in two stages:

  1. Determine the width and height of the table in cells. The height of the table is equal to the number of <tr> elements it has as children. An array of integers representing row lengths is then created. For each <td> element its 'colspan' value (or 1 if the colspan isn't specified) is added to the length of appropriate rows (determined by the 'rowspan' attribute's value). The width of the table is equal to the length of the longest row.
  2. Determine the values to place in the data structure. The two-dimensional array is created with the width and height previously determined. References to each <td>, and their right and bottom boundaries can be placed in the array, making sure that a cell is shifted right if it would otherwise overlap a cell already entered into the array.

The appendix to this dissertation contains a listing of the HTML Table Corrector class demonstrating how the above stages are programmed.

Changes after testing

Testing (described in detail in the Evaluation section) revealed two problems. Firstly, some well-formed tables cannot be replaced. The simplest example is table at the top of the opposite page, which cannot be divided into horizontal or vertical strips.

This caused the original replacement algorithm to not terminate.

The second problem was that some tables are not well-formed, and have cells that are either undefined or defined multiple times (overlapping). For example, the following code suggests a table with no cell in the upper right, and two cells ('B' and 'C') overlapping at the intersection of the second row and column:

<table>
  <tr><td>A</td><td rowspan="2">B</td></tr>
  <tr><td colspan="2">C</td><td>D</td></tr>
</table>

Tables like this would cause the table replacement algorithm to throw an ArrayIndexOutOfBoundsException.

The algorithm cannot be altered to convert these two cases - in one case conversion there is no equivalent using <div>s and cascading stylesheets, and in the other it is not clear what the author meant by their mark-up. The algorithm was altered to detect these situations and not replace the tables if they arose.

3.5. Code to replace frames

I knew from the start that some framesets couldn't be converted to <div>s and cascading stylesheets, as I had once created an example - a recursive frameset that resulted in interesting browser behaviour. Furthermore, some framesets can theoretically be converted, but actual conversion is totally impractical - for example, a frameset containing a link to an external site, which could lead to millions of possible page combinations.

The requirements analysis requires that frames be replaced with <div>s and cascading stylesheets where practical. The conversion is only practical where a very limited number of page combinations can occur. The algorithm used replaces frame sets where only one frame changes, and the other frames are either 'menu frames' (that change the contents of the changing frame) or 'title frames' (that do not change the contents of other frames).

The <div> structure

Deciding on the <div> structure to use is easier with framesets than with tables, as framesets are defined in terms of vertical and horizontal strips. The only complication comes from the fact that frameset allow some of the widths or heights to be set to an asterisk. This instructs the browser to share out remaining space equally between those frames. To replicate this with <div>s and cascading stylesheets, it is necessary to calculate the remaining width or height and how much each frame would get. For example, the value '*,50%,*' should result in frames of widths or heights of 25%, 50% and 25% respectively.

Finding document combinations

The Frame Corrector uses the HTML parser to parse the contents of the initial frames. All the <a> elements are then extracted from these frames. The frames are not replaced if the 'target' attributes refer to more than one of the frames, or if any non-relative (that is, external) link has the 'target' attribute set to one of the frames. For all the documents linked to with the 'target' set to the one of the frames, the same process is applied recursively. At the end of the process (which must terminate as there are only a finite number of possible internal links), if all the links meet the conditions above the frames can be replaced. A list of all possible documents that can occupy the frame that changes is produced by the recursive algorithm - all that is left is to alter the links and output the set of documents required. The documents are named based on the path to the current document in the changing frame, so that altering the links is easy - for example, the link:

<a href="philosophy/plato/republic.html" target="content">

Would be replaced by:

<a href="philosophy_plato_republic.html">

Altering the cascading stylesheets

The use of cascading stylesheets complicates the replacement of frames, as each frame may use a different stylesheets. The Frame Corrector gives each <div> a class attribute, and then alters the stylesheets so that the rules only apply within the appropriate <div>. This does not require a full cascading stylesheets parser - much simpler regular expressions can be used to alter the rules appropriately to use descendent selectors.

3.6. The interface

The requirements analysis does not require the project to have a Graphical User Interface, and a command-line interface would have been sufficient. However, a Graphical User Interface not only makes the program easier to use, but also aids in testing, for reasons detailed below. The interface was created immediately after the HTML Parser and Pretty Printer were implemented, and was used in their testing as well as in the testing of subsequent sections of the program.

The left-hand panel of the interface allows the user to select an HTML file to neaten. When the user clicks the 'Load and parse source' button, the file is parsed and it's document tree displayed in a JTree below. The middle panel allows the user to choose which transformations will be performed. The right-hand panel allows the user to specify to which file to output the neatened version. Clicking on the 'Correct and save document' button performs the transformations and outputs the file, showing the new document tree in the JTree below.

This interface aids in testing - the first JTree can be used to test the HTML Parser independently from the Pretty Printer, and the middle panel options mean components of the project can be tested independently.

This article was last edited on 16th October 2007. The author can be contacted using the form below.
Back to home page
Bookmark with: