Previous section: KGV and other validators
Next section: KGV's error messages
Up to main index
[see parent document for copyright information]
SGML stands for Standard Generalized Markup Language. This is actually a slight misnomer, since SGML is actually a meta-language --- that is, a language for writing markup languages. HTML is a markup language written in SGML --- an "SGML application", to use the terminology.
You don't actually have to know much about SGML to use KGV successfully. If you're interested, though, our users have suggested Shoreline's "A Guide to SGML" as a starting point. Additional SGML resources can be found on the SGML Open Home Page.
DOCTYPE thing KGV
keeps pestering me for?
DOCTYPE is an SGML document type
declaration. Its purpose is to tell an SGML parser what DTD it should use to parse the document. It appears as
the first line of the document, and has the form:
<!DOCTYPE HTML PUBLIC "quoted string">
The quoted string is called a public identifier; it refers to the desired DTD by a "well-known" name, usually defined by an associated standard.
Now, most Web browsers don't actually use an SGML parser (in fact, none
that I'm aware of do), and so they don't need a DOCTYPE
declaration, and will ignore it if present. KGV, however, does use
an SGML parser, and therefore needs a DOCTYPE
declaration. KGV is more insistent on this point than
WebTechs, which
would insert a DOCTYPE on the fly for you; KGV requires
that your DOCTYPE already be in the document.
So now you're preparing to add a DOCTYPE to your
document. Be sure that the syntax is as described above, and that you
use the correct public identifier; otherwise, KGV will use the wrong
DTD, or will be unable to find a DTD at all, and will produce a huge
list of absolutely meaningless errors. KGV's public
identifier catalog lists all the public identifiers KGV recognizes
for various types of HTML; of those, the following public
identifiers are most likely to be widely recognized:
"-//IETF//DTD HTML 2.0//EN" for HTML 2.0
"-//IETF//DTD HTML 3.0//EN" for HTML 3.0
"-//W3C//DTD HTML 3.2//EN" for HTML 3.2
"-//Netscape Comm. Corp.//DTD HTML//EN" for Netscape
"-//Microsoft//DTD Internet Explorer 2.0 HTML//EN" for MSIE
Note that the string must appear exactly as shown, including case.
[Editor's note: To be pedantic, the "official" public identifier for HTML 3.0, according to the catalog from the HTML standard, is:
"-//W3C//DTD HTML 3 1995-03-24//EN"
KGV's catalog includes this public identifier, but WebTechs' public identifier catalog doesn't; if you use this public identifier, then, your document will fail to validate under WebTechs.]
Future plans for KGV include the addition of DTD's that incorporate
recent IETF work on HTML, such as the recent table draft and the
proposed <EMBED> element.
WARNING:
Some HTML editors will insert a DOCTYPE declaration for
you. Unfortunately, this pre-inserted DOCTYPE will
sometimes confuse KGV. This usually occurs in one (or both) of two
ways:
DOCTYPE does not correspond to the generated
HTML. For instance, one popular
editor will insert a DOCTYPE with the public
identifier for HTML 3.0, even though the HTML it generates
generally does not conform to the HTML 3.0 DTD.
DOCTYPE includes a system identifier. For
instance, one popular editor
produces a DOCTYPE of the form:
<!DOCTYPE HTML PUBLIC "..." "...">
This is actually legal syntax [at least I think it is; I'm trying to sort it out. Sorry for the confusion. -ed]; the second quoted string is called a system identifier, and defines a system-specific method of finding the DTD --- it might, for instance, point to a file on the local file system. The problem is that KGV tries to find the DTD via this system identifier, fails, and becomes confused.
If your editor adds a DOCTYPE to your page, you may need to
correct it as described above before running your page through KGV.
DOCTYPE should I use?In SGML, a document can only use the elements and attributes defined in its associated DTD. If your document uses vendor-specific or recently standardized features, then, you must use a DTD that defines the necessary elements.
Nowadays, you will probably want to use one of the 4.0 DTD's; which one
depends on what features you're using. For frameset documents or documents
that use frame-related elements or attributes like <A
TARGET>, you will want to use the
4.0
Frameset DTD (hence the name...
):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"
"http://www.w3.org/TR/REC-html40/frameset.dtd">
For documents using any of the more common deprecated Netscape/MSIE
presentational elements or attributes like <FONT>,
you'll need 4.0
Transitional:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
Ideally, you'll want to use 4.0 Strict:
with presentational aspects specified in a style sheet. If your document
gets unrecognized element or attribute errors with a 4.0 Strict
DOCTYPE, though, this usually means you need to use 4.0
Transitional.
This is unfortunately a complex subject, not only because HTML inherited a rather weird comment syntax from SGML (where comments are actually considered a kind of markup declaration), but because so many browsers get it spectacularly wrong. First, here's the full definition of an SGML comment declaration:
<!', ends with the character `>', and
contains zero or more comments.
--',
and may not contain `--'. Whitespace is allowed after any
comment, but not before the first comment. Non-whitespace characters
are not allowed before or after comments.
>' inside a comment does not close the
comment declaration.
Here are some examples:
<!-- A simple comment declaration -->
<!-- An equally simple comment declaration -- >
<!-- A comment declaration -- -- with more
than one --
-- comment in it --
>
<!-- This `>' is inside a comment, and so doesn't close
the comment declaration; the next one does, because it's
not inside a comment. -->
<!>
And here are some examples of incorrect comments:
<!-- This comment is not terminated >
<!-- Neither is this one ->
<!-- -- Text is not allowed between comments -- -- >
<! The initial `--' is not optional >
<! -- Nor is whitespace allowed before it -->
<!------>
That last one is particularly pernicious, since many people use variants
of it as "dividing bars" in their HTML source. Why is it invalid? Note
that it has six `-' characters, or three `--'
pairs; the first pair opens a comment, the second pair closes it, and
the third pair opens a second comment. Thus, the `>'
that appears to close the comment declaration doesn't, because it's
inside a comment; and the comment declaration proceeds merrily along,
unterminated and out of control. Now, if there had been eight
`-' characters (or any other multiple of four), everything
would have been fine, since each comment in the comment declaration
would have been closed.
And just to throw a monkey wrench into the works, many popular browsers have ad hoc (and thoroughly broken) comment parsers. Notably, various versions of Mosaic and Netscape (which share a great deal of parsing code) incorrectly parse some or all of the incorrect comment examples above as correct, and some of the correct comment examples above as incorrect --- and which ones they get right and wrong varies from version to version.
If you want your comments to work on both broken and non-broken
browsers, the safest thing to do is to restrict yourself to the
following subset of legal comment syntax that all browsers (as far as I
know) will handle correctly: comments begin with
`<!--', end with `-->', and may not
contain `--' or `>'.
Previous section: KGV and other validators
Next section: KGV's error messages
Up to main index
[see parent document for copyright information]
Sending feedback? Check here first.
Last update 01 Sep 2010
dsb@killerbunnies.org