Neatening HTML - Introduction

This page is part of Neatening HTML, a dissertation on the subject of the automated processing and correction of poor HTML code. It should be noted that the dissertation was written in early 2004, and so some of its content may no longer be relevant.

1. Introduction

1.1 Motivation

A brief history of 'tag soup'

In the early 1990s, CERN employee Tim Berners-Lee created HTML, the Hypertext Mark-up Language, based on the 'SGMLguid' language then used at CERN [1]. The early HTML was designed for the mark-up of text only, and when the World Wide Web was created browser vendors soon added their own tags for other non-textual content, ignoring the SGML syntax - Mosaic (the ancestor of Netscape Navigator and Internet Explorer) added <img>, Sun Microsystems' HotJava browser added <applet>, and Netscape infamously added <blink>. This divergence made it difficult to write HTML to work with a range of browsers.

Tim Berners-Lee set up the World Wide Web Consortium (W3C) [2] to standardise web technology. Although this has worked to some extent with technologies such as stylesheets, the W3C wrote HTML standards that documented the current state of the language - arbitrarily deciding which company's new mark-up to declare standard, rather than suggesting the way forward (the one exception was HTML3.0, but that only reached draft status and did not become a specification). This has led to the widespread use of browser-dependent features such as frames, the <applet> tag, and layers. Although some browser vendors are committed to standards (most notably the Mozilla Foundation [3]), other companies seem to encourage the use of non-standard and unreliable features by creating HTML editors that routinely insert incorrect mark-up (for example, try validating the results of Microsoft Word's 'Save As HTML' feature). Moreover, WYSIWYG editors treat HTML as a presentational, rather than logical, mark-up language, and use mark-up inappropriately, such as using tables for layout purposes. This has serious consequences for accessibility.

The W3C created the Web Accessibility Initiative (WAI) [4] to highlight common accessibility problems. It argued in its Web Content Accessibility Guidelines [5] that accessibility issues do not only affect disabled web users, and should concern everyone that produces web pages. It produced the Authoring Tool Accessibility Guidelines [6] to guide creators of HTML editors so that the web pages produced would be accessible to those with disabilities. These guidelines extend beyond the necessity to produce valid HTML, and cover areas such as the avoidance of the use of tables for layout purposes and the need to use appropriate 'alt' (alternative) text with images (many editors default to using the image's file name as alternative text). Unfortunately these guidelines have been ignored by the creators of most authoring tools.

The invention of cascading stylesheets (CSS) [7] allowed the separation of content and presentation, and their use leads to web pages that are not only easier to maintain but are much more concise without the 'tag soup' of presentational tags such as <font>. Five years after the release of the first CSS specification, some browsers' support for stylesheets is still basic (most notably, the popular Internet Explorer browser does not support some of the basic selectors), few HTML editors support them, and most web authors are unaware of them. To make matters worse, the most common use of stylesheets is to change hyperlink colours or remove hyperlink underlines, which causes further accessibility problems [8][9].

References

  • [1] The Early History Of HTML - http://infomesh.net/html/history/early/
  • [2] The World Wide Web Consortium - http://www.w3.org/
  • [3] The Mozilla Foundation - http://www.mozilla.org/
  • [4] The Web Accessibility Initiative - http://www.w3.org/WAI/
  • [5] Web Content Accessibility Guidelines - http://www.w3.org/TR/WCAG10/
  • [6] Authoring Tool Accessibility Guidelines - http://www.w3.org/TR/ATAG10/
  • [7] Cascading Stylesheets, Level Two - http://www.w3.org/TR/REC-CSS2
  • [8] Removing Link Underlines – http://www.safalra.com/hypertext/html/underlinelinks.html
  • [9] Link Colours – http://www.safalra.com/hypertext/html/linkcolours.html

1.2. Aims

The intention of this project is to produce software that, provided with an HTML document:

  • removes non-standard mark-up and features regarded as undesirable (for example, frames and tables used for layout purposes)
  • replaces purely presentation mark-up with stylesheets
  • corrects certain accessibility problems (those which do not require the software to comprehend natural language)
  • leaves the structure (and where possible the appearance) of the document as similar to the original document as possible.

1.3. Related work

Browser transformations, HTMLConverter and HTMLTidy

Many common browsers perform transformations when saving HTML documents; these are however very basic:

  • Internet Explorer changes the whitespace and capitalises the names of elements and attributes (which has the effect of making the HTML source code harder to read).
  • The Mozilla project's Firefox browser inserts optional end tags and implied elements (such as the <tbody> tag) based on its internal document tree.

Most browsers also change the paths of links so that they still work when taken out of context, but this is not relevant to this project.

Sun Microsystems' HTMLConverter, part of the Java Standard Development Kit, replaces occurrences of the deprecated <applet> element with the preferred <object> element, but then uses invalid syntax (the non-standard <embed> tag and the non-existent <comment> tag) in an attempt at backwards compatibility. Apparently Sun's intention is not to improve the HTML, but to try to force web browsers to use Sun's Java plug-in rather than the browsers' built-in implementation of the Java bytecode interpreter. Attempts to force browser behaviour show a fundamental misunderstanding of the purpose of a cross-platform mark-up language such as HTML.

More advanced transformations are performed by W3C employee Dave Raggett's HTMLTidy [1] program. This corrects many common problems with hand-written HTML, such as mismatched tags, but its main aim is to improve the readability of HTML produced by humans – it does not perform the advanced transformations this project aims to be able to do.

References

  • [1] HTMLTidy - http://www.w3.org/People/Raggett/tidy/
This article was last edited on 1st September 2007. The author can be contacted using the form below.
Back to home page
Bookmark with: