7/07/2012

HTML

To start off my series of web development posts, let's have a quick look at basic HTML.

Simply speaking the HyperText Markup Language is what drives the web and has ever been for the last 20 years. You can find an archived 1992 version of the very first website at W3C, as well as screenshots of the line-mode browser and the WorldWideWeb application for NeXTStep (a descendant of UNIX and BSD and ancestor of Darwin and Mac OS X) at CERN's website. So, what do you see? Ugly, plain text including some (hyper)links. Now have a look at Facebook, the world's second most visited website (following Google). Obviously there has been some development.

So let's write a simple profile page for John Doe following the early HTML drafts. You may open up a text editor and paste in the following lines:

<title>John Doe</title>
<h1>John Doe</h1>
<h2>Hobbies</h2>
<ul>
  <li><a href=http://en.wikipedia.org/wiki/Fishing>Fishing</a></li>
  <li>Cars</li>
</ul>
<h2>Employer</h2>
<a href=http://www.defense.gov/>United States Department of Defense</a>
<h2>Bio</h2>
<p>Hi, my name's John Doe. I'm working for DOD, the world's biggest employer.</p>

Opposed to plain text, HTML introduces tags, defined in a meta language called SGML. A tag is a pre-defined keyword enclosed in pointy brackets. If it has any content, it is terminated with a corresponding closing tag denoted as </tagname>. Whitespace around tags does not matter - you do not need the indentation and may even write it in a single line. Also, tag names are case-insensitive.

Save the text e. g. as doe.html and open the file in your browser. Most likely it will work as intended - not because modern browsers support the early drafts, but because everything used has been valid in every later standard specification and the browsers are all quite fault-tolerant considering the now insufficient structure. A quick explanation for all the tags used:

  • title: the page title shown in your tab or window title
  • h1: first level headline
  • h2: second level headline
  • ul: unordered list
  • li: list item
  • p: paragraph
  • a: anchor

Those are all from the original draft and still the most relevant. As you can see, elements may be nested: the list holds list items and the first list item holds an anchor. Anchors are the most important idea for hypertext, since they enable all the hyperlinking. In our example, the <a> tag has an attribute called href, which holds a URL, the tag's value should link to.

Let's go forward to the first standard: HTML 2.0 from November 1995.

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
  <head>
    <title>John Doe</title>
  </head>
  <body>
    <h1>John Doe</h1>
    <img src="doe.jpg" alt="John Doe">
    <h2>Hobbies</h2>
    <ul>
      <li><a href="http://en.wikipedia.org/wiki/Fishing">Fishing</a></li>
      <li>Cars</li>
    </ul>
    <h2>Employer</h2>
    <a href="http://www.defense.gov/">United States Department of Defense</a>
    <h2>Bio</h2>
    <p>Hi, my name's John Doe. I'm working for DOD, the world's biggest employer.</p>
  </body>
</html>

We now have to include a doctype, stating what we're actually claiming to conform with, the complete document is wrapped in a root element called <html> and this one holds exactly two other new elements: <head>, holding all the meta data and <body> holding the actual content. Attribute values are enclosed in quotes (single or double) and we added a new element called <img>. Obviously it includes an image, whose source is the relative path doe.jpg. So get an arbitrary photo of John Doe and save it to the same folder as the HTML file named doe.jpg.

January 1997: HTML 3.2. Microsoft Internet Explorer and Netscape Navigator both implemented more and more additional features like tables and embedded objects, which got merged into the standard. Back then, the web looked like this. At the end of the year, HTML 4.0 was published and - after a reissue and a slight extension called HTML 4.01 in the following two years - is still the most recent HTML standard today.

The most important addition in HTML 4 are frames, basically defining a grid of frames the view should be split into and resources to be displayed in each of those frames. Today, frames are not used for structuring a website, anymore. Only some sites like URL shorteners use them to display an own toolbar above the target site and the <iframe> tag, which implements a third-party website into one's own document, is heavily used by things like social plugins.

Another major change in HTML 4 was the awareness of style sheets. The first CSS standard had already been around for a year and therefore HTML 4 deprecated some presentational elements like <font> defining a font to use, <center> centering the content or <u> underlining the content in favor of style sheets. Necessary steps had already been taken in HTML 3.2 introducing style sheet related stuff "for future use". I'll have a look at the use and importance of CSS in the next post.

The result of these two important changes was, that HTML 4 included not one, but three document type definitions: strict, the recommended default not allowing deprecated tags, transitional allowing the use of deprecated (presentational) tags and frameset solely for defining framesets.


As I already mentioned, HTML 4 is still the most recent HTML standard. This of course does not include two finished XHTML standards as well as the already widely adopted HTML5, which is still in draft status. XHTML 1.0 from January 2000 is a take on the diversity of HTML in practice. The problem is, the following is a perfectly valid HTML 4 fragment:

<UL>
  <li><a href='http://en.wikipedia.org/wiki/Fishing'>Fishing</a>
  <li>Cars</LI>
</uL>
<H2>Employer</H2>
<a href="http://www.defense.gov/">United States Department of Defense</A>

Single quotes, double quotes, case-insensitivity, an omitted </li> tag... Therefore W3C, the World Wide Web Consortium, redefined HTML 4.01 in the popular Extensible Markup Language (XML). XML is a subset of SGML, the meta language HTML has been defined in, and in comparison quite simple and restrictive. I'll have a look at XML itself in a later post.

The most important restrictions: All tags need to be closed. In case of tags without content like our <img> tag, self-closing tags are used: <img src="doe.jpg" alt="John Doe"/>. Element names are case sensitive, in case of HTML all lowercase. Every attribute has a value and string values are set in quotes, e. g. HTML 4 allows <img src="doe.jpg" alt="John Doe" ismap> and <img src="doe.jpg" alt="John Doe" ismap=ismap> while in XHTML it has to be <img src="doe.jpg" alt="John Doe" ismap="ismap"/>.

This not only clarifies things, but also makes browser development easier, since parsing may be done with any XML parser implementation instead of custom HTML-specific solutions. But in practice this is not an argument, since whatever crap you hand to a browser, it always tries to make some sense of it and finally shows some webpage. This does not apply to an XML parser and therefore browsers did not make use of this property.

XHTML 1.0 is a redefinition of HTML 4.01 defining equivalent strict, transitional and frameset types. XHTML 1.1 contains some small changes, but especially drops the last two and only defines one single (strict) type again.

A completely new take is HTML5, which reintegrates classic HTML and XHTML and defines a bunch of new features for e. g. multimedia applications, explicit semantics, local storage and drawing. HTML5 is no longer based on SGML, but designed to be backwards-compatible. It also includes an XHTML5 dialect, which is syntactically quite like XHTML, i. e. lower-case tags, every tag's closed and so on. On the other hand, HTML5 tags are case-insensitive again and void tags like <img> do not need explicit closing (while HTML5 allows XML-style self-closing tags), some closing tags may even be left out again. HTML5 has a SGML-style doctype not stating any revision: <!doctype html>. For the XHTML5 dialect this is the same, but optional. The main difference is that XHTML5 documents need to be served as type application/xhtml+xml or application/xml, instead of text/html.

As already mentioned, HTML5 is still not an approved standard, but already widely adopted, which is the case for many other web standards as well, since the web evolves much faster than any standard committee could possibly ever get.

One last word: When producing HTML, try to be standards-compliant. It possibly saves you a lot of trouble, I'll mention in the browser section. Nice validators are provided by W3C and Validome.

Next, I'll have a look at CSS and give John Doe a little style.

No comments:

Post a Comment