Previous section: KGV and other validators
Next section: KGV's error messages
Up to main index

[see parent document for copyright information]


Section 2: HTML and SGML

What is SGML?

SGML stands for Standard Generalized Markup Language. This is actually a slight misnomer, since SGML is actually a meta-language --- that is, a language for writing markup languages. HTML is a markup language written in SGML --- an "SGML application", to use the terminology.

You don't actually have to know much about SGML to use KGV successfully. If you're interested, though, our users have suggested Shoreline's "A Guide to SGML" as a starting point. Additional SGML resources can be found on the SGML Open Home Page.

What is a DTD?
For our purposes, a DTD, or Document Type Definition, is simply a file that defines the syntax of an SGML-based language. The DTD's for HTML 2.0 and HTML 3.0 are written by the HTML Working Group of the Internet Engineering Task Force (IETF), in collaboration with the W3 Consortium. In addition, the folks at WebTechs have written DTD's which attempt to reproduce the behaviors of Netscape's Mozilla browser and Sun's HotJava browser. (Note that these DTD's are not endorsed by the respective companies, and are not guaranteed to completely mimic the behaviors of the respective browsers.)
What is this DOCTYPE thing KGV keeps pestering me for?

DOCTYPE is an SGML document type declaration. Its purpose is to tell an SGML parser what DTD it should use to parse the document. It appears as the first line of the document, and has the form:

    <!DOCTYPE HTML PUBLIC "quoted string">

The quoted string is called a public identifier; it refers to the desired DTD by a "well-known" name, usually defined by an associated standard.

Now, most Web browsers don't actually use an SGML parser (in fact, none that I'm aware of do), and so they don't need a DOCTYPE declaration, and will ignore it if present. KGV, however, does use an SGML parser, and therefore needs a DOCTYPE declaration. KGV is more insistent on this point than WebTechs, which would insert a DOCTYPE on the fly for you; KGV requires that your DOCTYPE already be in the document.

So now you're preparing to add a DOCTYPE to your document. Be sure that the syntax is as described above, and that you use the correct public identifier; otherwise, KGV will use the wrong DTD, or will be unable to find a DTD at all, and will produce a huge list of absolutely meaningless errors. KGV's public identifier catalog lists all the public identifiers KGV recognizes for various types of HTML; of those, the following public identifiers are most likely to be widely recognized:

    "-//IETF//DTD HTML 2.0//EN"                         for HTML 2.0
    "-//IETF//DTD HTML 3.0//EN"                         for HTML 3.0
    "-//W3C//DTD HTML 3.2//EN"                          for HTML 3.2
    "-//Netscape Comm. Corp.//DTD HTML//EN"             for Netscape
    "-//Microsoft//DTD Internet Explorer 2.0 HTML//EN"  for MSIE

Note that the string must appear exactly as shown, including case.

[Editor's note: To be pedantic, the "official" public identifier for HTML 3.0, according to the catalog from the HTML standard, is:

    "-//W3C//DTD HTML 3 1995-03-24//EN"

KGV's catalog includes this public identifier, but WebTechs' public identifier catalog doesn't; if you use this public identifier, then, your document will fail to validate under WebTechs.]

Future plans for KGV include the addition of DTD's that incorporate recent IETF work on HTML, such as the recent table draft and the proposed <EMBED> element.

WARNING: Some HTML editors will insert a DOCTYPE declaration for you. Unfortunately, this pre-inserted DOCTYPE will sometimes confuse KGV. This usually occurs in one (or both) of two ways:

If your editor adds a DOCTYPE to your page, you may need to correct it as described above before running your page through KGV.

My document uses _______. Which DOCTYPE should I use?

In SGML, a document can only use the elements and attributes defined in its associated DTD. If your document uses vendor-specific or recently standardized features, then, you must use a DTD that defines the necessary elements.

Nowadays, you will probably want to use one of the 4.0 DTD's; which one depends on what features you're using. For frameset documents or documents that use frame-related elements or attributes like <A TARGET>, you will want to use the 4.0 Frameset DTD (hence the name... ;)):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"
        "http://www.w3.org/TR/REC-html40/frameset.dtd">

For documents using any of the more common deprecated Netscape/MSIE presentational elements or attributes like <FONT>, you'll need 4.0 Transitional:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
        "http://www.w3.org/TR/REC-html40/loose.dtd">

Ideally, you'll want to use 4.0 Strict:


with presentational aspects specified in a style sheet. If your document gets unrecognized element or attribute errors with a 4.0 Strict DOCTYPE, though, this usually means you need to use 4.0 Transitional.

What's the correct syntax for HTML comments? (or: Why does my document come up blank on some browsers?)

This is unfortunately a complex subject, not only because HTML inherited a rather weird comment syntax from SGML (where comments are actually considered a kind of markup declaration), but because so many browsers get it spectacularly wrong. First, here's the full definition of an SGML comment declaration:

Here are some examples:

    <!-- A simple comment declaration -->

    <!-- An equally simple comment declaration --   >

    <!-- A comment declaration --        -- with more
      than one --
      -- comment in it --
    >

    <!-- This `>' is inside a comment, and so doesn't close
         the comment declaration; the next one does, because it's
         not inside a comment. -->

    <!>

And here are some examples of incorrect comments:

    <!-- This comment is not terminated >

    <!-- Neither is this one ->

    <!-- -- Text is not allowed between comments -- --  >

    <! The initial `--' is not optional >

    <! -- Nor is whitespace allowed before it -->

    <!------>

That last one is particularly pernicious, since many people use variants of it as "dividing bars" in their HTML source. Why is it invalid? Note that it has six `-' characters, or three `--' pairs; the first pair opens a comment, the second pair closes it, and the third pair opens a second comment. Thus, the `>' that appears to close the comment declaration doesn't, because it's inside a comment; and the comment declaration proceeds merrily along, unterminated and out of control. Now, if there had been eight `-' characters (or any other multiple of four), everything would have been fine, since each comment in the comment declaration would have been closed.

And just to throw a monkey wrench into the works, many popular browsers have ad hoc (and thoroughly broken) comment parsers. Notably, various versions of Mosaic and Netscape (which share a great deal of parsing code) incorrectly parse some or all of the incorrect comment examples above as correct, and some of the correct comment examples above as incorrect --- and which ones they get right and wrong varies from version to version.

If you want your comments to work on both broken and non-broken browsers, the safest thing to do is to restrict yourself to the following subset of legal comment syntax that all browsers (as far as I know) will handle correctly: comments begin with `<!--', end with `-->', and may not contain `--' or `>'.


Previous section: KGV and other validators
Next section: KGV's error messages
Up to main index

[see parent document for copyright information]

Sending feedback? Check here first.

HTML Pro Checked!

Last update 01 Sep 2010

dsb@killerbunnies.org