This page is part of Neatening HTML, a dissertation on the subject of the automated processing and correction of poor HTML code. It should be noted that the dissertation was written in early 2004, and so some of its content may no longer be relevant.
1. Introduction
1.1 Motivation
A brief history of 'tag soup'
In the early 1990s, CERN employee Tim Berners-Lee created HTML, the Hypertext
Mark-up Language, based on the 'SGMLguid' language then used at CERN [1]. The
early HTML was designed for the mark-up of text only, and when the World Wide
Web was created browser vendors soon added their own tags for other
non-textual content, ignoring the SGML syntax - Mosaic (the ancestor of
Netscape Navigator and Internet Explorer) added <img>, Sun
Microsystems' HotJava browser added <applet>, and Netscape
infamously added <blink>. This divergence made it difficult
to write HTML to work with a range of browsers.
Tim Berners-Lee set up the World Wide Web Consortium (W3C) [2] to standardise
web technology. Although this has worked to some extent with technologies such
as stylesheets, the W3C wrote HTML standards that documented the current state
of the language - arbitrarily deciding which company's new mark-up to declare
standard, rather than suggesting the way forward (the one exception was
HTML3.0, but that only reached draft status and did not become a
specification). This has led to the widespread use of browser-dependent
features such as frames, the <applet> tag, and layers.
Although some browser vendors are committed to standards (most notably the
Mozilla Foundation [3]), other companies seem to encourage the use of
non-standard and unreliable features by creating HTML editors that routinely
insert incorrect mark-up (for example, try validating the results of Microsoft
Word's 'Save As HTML' feature). Moreover, WYSIWYG editors treat HTML as a
presentational, rather than logical, mark-up language, and use mark-up
inappropriately, such as using tables for layout purposes. This has serious
consequences for accessibility.
The W3C created the Web Accessibility Initiative (WAI) [4] to highlight common accessibility problems. It argued in its Web Content Accessibility Guidelines [5] that accessibility issues do not only affect disabled web users, and should concern everyone that produces web pages. It produced the Authoring Tool Accessibility Guidelines [6] to guide creators of HTML editors so that the web pages produced would be accessible to those with disabilities. These guidelines extend beyond the necessity to produce valid HTML, and cover areas such as the avoidance of the use of tables for layout purposes and the need to use appropriate 'alt' (alternative) text with images (many editors default to using the image's file name as alternative text). Unfortunately these guidelines have been ignored by the creators of most authoring tools.
The invention of cascading stylesheets (CSS) [7] allowed the separation of
content and presentation, and their use leads to web pages that are not only
easier to maintain but are much more concise without the 'tag soup' of
presentational tags such as <font>. Five years after the
release of the first CSS specification, some browsers' support for stylesheets
is still basic (most notably, the popular Internet Explorer browser does not
support some of the basic selectors), few HTML editors support them, and most
web authors are unaware of them. To make matters worse, the most common use of
stylesheets is to change hyperlink colours or remove hyperlink underlines,
which causes further accessibility problems [8][9].
References
- [1] The Early History Of HTML - http://infomesh.net/html/history/early/
- [2] The World Wide Web Consortium - http://www.w3.org/
- [3] The Mozilla Foundation - http://www.mozilla.org/
- [4] The Web Accessibility Initiative - http://www.w3.org/WAI/
- [5] Web Content Accessibility Guidelines - http://www.w3.org/TR/WCAG10/
- [6] Authoring Tool Accessibility Guidelines - http://www.w3.org/TR/ATAG10/
- [7] Cascading Stylesheets, Level Two - http://www.w3.org/TR/REC-CSS2
- [8] Removing Link Underlines – http://www.safalra.com/hypertext/html/underlinelinks.html
- [9] Link Colours – http://www.safalra.com/hypertext/html/linkcolours.html
1.2. Aims
The intention of this project is to produce software that, provided with an HTML document:
- removes non-standard mark-up and features regarded as undesirable (for example, frames and tables used for layout purposes)
- replaces purely presentation mark-up with stylesheets
- corrects certain accessibility problems (those which do not require the software to comprehend natural language)
- leaves the structure (and where possible the appearance) of the document as similar to the original document as possible.
1.3. Related work
Browser transformations, HTMLConverter and HTMLTidy
Many common browsers perform transformations when saving HTML documents; these are however very basic:
- Internet Explorer changes the whitespace and capitalises the names of elements and attributes (which has the effect of making the HTML source code harder to read).
-
The Mozilla project's Firefox browser inserts optional end tags and implied
elements (such as the
<tbody>tag) based on its internal document tree.
Most browsers also change the paths of links so that they still work when taken out of context, but this is not relevant to this project.
Sun Microsystems' HTMLConverter, part of the Java Standard Development Kit,
replaces occurrences of the deprecated <applet> element
with the preferred <object> element, but then uses invalid
syntax (the non-standard <embed> tag and the non-existent
<comment> tag) in an attempt at backwards compatibility.
Apparently Sun's intention is not to improve the HTML, but to try to force web
browsers to use Sun's Java plug-in rather than the browsers' built-in
implementation of the Java bytecode interpreter. Attempts to force browser
behaviour show a fundamental misunderstanding of the purpose of a
cross-platform mark-up language such as HTML.
More advanced transformations are performed by W3C employee Dave Raggett's HTMLTidy [1] program. This corrects many common problems with hand-written HTML, such as mismatched tags, but its main aim is to improve the readability of HTML produced by humans – it does not perform the advanced transformations this project aims to be able to do.
References
- [1] HTMLTidy - http://www.w3.org/People/Raggett/tidy/