This page is part of Neatening HTML, a dissertation on the subject of the automated processing and correction of poor HTML code. It should be noted that the dissertation was written in early 2004, and so some of its content may no longer be relevant.
5. Conclusions
The project achieved its aims - it performs a variety of transformation on HTML to make it neater. The code it outputs is not perfect however, and from the imperfections three conclusions can be drawn:
- The idea of automated HTML neatening can be taken further, with benefits for both cross-browser compatibility and accessibility.
- Automated HTML neatening, no matter how advanced, will never be the ideal solution; authors must be educated in how to produce good HTML.
- HTML itself is inadequate as a mark-up language, even with the advances in Cascading Stylesheets - a true logical mark-up language is needed, and only then can we hope for authors to produce true logical mark-up.
The project performs some advanced transformations that have not been attempted before, such as the replacement of tables used for layout purposes. This is, however, only the beginning of what may be possible. A document using cascading stylesheets for layout is better than one using tables, but many of the accessibility problems still remain - it may still not make sense linearised, as a user of a screen reader would perceive it. The project makes no attempt to deduce the logical ordering of elements, but this is a possible future direction - it does not necessarily require the comprehension of natural language, as much can be deduced from the mark-up alone. The project does not attempt to adjust colour schemes to make pages more readable to people with colour blindness, but with knowledge of colour perception it is possible to automatically adjust colours in this way while keeping them aesthetically pleasing.
A very different future direction would be to extend the project to cope with
DHTML. DHTML (Dynamic HTML) uses Javascript to alter documents after they have
been loaded. Some of the transformations performed by the project (for
example, replacing occurrences of the deprecated <layer> element with the
<div> element) can prevent old DHTML scripts from working.
In most cases, automated correction of these scripts is theoretically
possible. This would require a program to build a representation of the
Javascript code and perform transformations on it. Like HTML, Javascript has
problems with the common use of non-standard code and many usability
issues - a project to solve all of these problems would be as complex as this
project.
The project has come across problems it could not solve. In some cases, a
solution, though not obvious, is still possible - for example, there have been
tables the project could not replace that I was able to replace by hand,
using a variety of tricks involving floated elements. In other cases
correction is not possible as it is unclear what the author intended. Authors
can produce code so bad that conceptually it does not make sense, although it
will display the way the author intended in their own browser. The only
solution is to educate the authors so that the are aware that standards exist
and what elements mean - perhaps then they will stop using
<blockquote> for indentation and elements specific to an
individual browser to control layout.
An automated HTML neatening program can only work with what the author gives
it. Suppose an author took a plain text document and put
<html><body> before it and
</body></html> after it. Without the comprehension of
natural language, it is impossible for a program to introduce appropriate
mark-up. Once again, the solution is to educate the authors, so they know that
elements exist for marking up emphasised text, headings, and the like.
There is one further problem with automated HTML neatening - it corrects the document, but not the ideas of the author that produced it. When the author needs to change the document, new errors will be introduced. The solution, once again, is to educate authors - HTML neatening is useful as an aid for knowledgeable authors wishing to improve a large number of documents, but it is of little assistance for those that do not understand what it does and why.
Finally, there are problems with HTML itself, most of which are not solved in
the new XML-based XHTML. In the Implementation I mentioned tables and
framesets that could not be replaced, even theoretically - tables that defined
a cell multiple times or not at all, and recursive framesets that would cause
many browsers to stop responding. It is possible, through careful design, to
avoid these problems. A redesign of tables, remembering that they are designed
to structure tabular data, could solve the problems of ambiguously defined
cells. The most common use for framesets is to have a menu frame that doesn't
move when a contents frame scroll - a general <navigation>
element would allow browsers to treat menus separately from the document,
yielding advances in usability.
With the introduction of cascading stylesheets, it is common to see documents
where most of the elements are semantically empty - either the
<div> or <span> element. These tell us
that HTML is poor in terms of logically mark-up; it has mark-up for
definitions, but only in lists; it has mark-up for headings, but not true
sectioning; it has mark-up for paragraphs, but not for lines (in the sense of
poetry or programs).
In conclusion, an HTML neatening program tries to solve the problems caused by a poor mark-up language, poorly implemented in most browsers, and poorly understood by most authors.