A little background on SGML
A "markup language" is a set of conventions for encoding text. Typical examples would be annotations or other marks which tell a compositor how to layout a section of text (including a system like TeX), proofreader's marks, or structural annotations.
Standard Generalized Markup Language (SGML) is an
international standard which defines a grammar for constructing markup languages. In a sense, it is a meta-language which a language author uses to construct a markup language for a particular use. Applications of SGML are typically descriptive, in that they describe the document structure (somewhat like the higher level parts of LaTeX) rather than describing the procedural details of how the text should be processed.
Our particular interest in SGML is that the HyperText Markup Language (HTML) used by the World Wide Web is, in principle, an application of SGML. In practice, very few web browsers are based on SGML parsers (unfortunately), and some of the proprietary extensions to HTML (e.g. the Netscape "enhancements") do not conform to SGML standards and hence cannot be properly described by an SGML document type. Regardless, testing your documents against a real SGML parser is still the best way to ensure that they obey the rules for properly constructed HTML and are therefore usable in the widest range of browsers and summarizers (in particular, note that the search engine used on our web server may not fully summarize invalid HTML documents).
The DOCTYPE declaration
To use SGML tools on your HTML documents, you
must include a valid SGML DOCTYPE declaration that tells the SGML parser what document type definition (DTD) to use to parse your documents. A list of the locally supported DOCTYPE declarations can be found in the file /usr/local/lib/sgml/catalog. For IETF standard HTML 2.0, use
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
or
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Strict//EN">
as the first line of your document.
For the (expired) HTML 3.0 draft, I have set the catalog to accept
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3 1995-03-24//EN">
or
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.0//EN">
These may confuse some validation services which expect to see IETF declared as the owner of the HTML 3.0 DTD, but since 3.0 is no longer a standards track document listing IETF as the owner is strictly incorrect.
For HTML 3.2, use
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
If you insist on using Netscapisms, you can try the doctype
<!DOCTYPE HTML PUBLIC "-//Netscape Comm. Corp.//DTD HTML//EN">
Be warned that this is not an official DTD--Netscape has never issued a DTD for their features, or even given a sufficiently precise description of them to unambiguously specify their grammar. This DTD also does not reflect many of the latest Netscape features.
Checking your document with nsgmls
Once you have a DOCTYPE declaration, you can use the SGML parser from the
sp package,
nsgmls, to validate your document. There is a simple script, html-ncheck, which takes care of the command line options for simple HTML document validation, so just do
html-ncheck file.html
sp can also normalize your HTML document. Normalization inserts omitted end tags, expands minimized attributes, etc., producing a version of your document with all the hidden assumptions revealed. This can be very useful for find out where your paragraphs really begin and end and identifying redundant markup. Some SGML tools may also require input in full normalized form. The script html-spam provides a simplified interface to the sp normalizer
spam.
gf can convert documents written in some SGML doctypes (including HTML) into a variety of presentation formats, including LaTeX, texinfo, and plain text. gf uses nsgmls to parse the input document, so what you feed it must be fully conformant HTML 2.0 with a valid doctype declaration. See the
gf info pages for usage.
Editing with psgml
psgml is an emacs major mode for editing SGML. Since psgml is driven by the SGML DTD, it actually knows what elements and attributes are valid at any point in your document, can normalize your document, and can find "trouble spots". To use psgml mode for HTML, add something like this to your .emacs:
; psgml stuff
(autoload 'sgml-mode "psgml" "Major mode to edit SGML files" t)
(setq sgml-always-quote-attributes t ; expected by many clients
sgml-auto-insert-required-elements t
sgml-indent-data nil
sgml-indent-step 2
sgml-auto-activate-dtd t ; preload dtd
sgml-omittag nil
sgml-shorttag nil
sgml-recompile-out-of-date-cdtd 'ask
sgml-set-face window-system
sgml-validate-command "nsgmls -s %s %s")
; menus for creating new documents
(setq sgml-custom-dtd
'(
( "HTML 2.0"
"<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML//EN\">" )
( "HTML 2.0 Level 1"
"<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML Level 1//EN//2.0\">" )
( "HTML 3.0"
"<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.0//EN\">")
)
)
(defun html-mode ()
(interactive)
(sgml-mode)
(make-local-variable 'sgml-declaration)
(make-local-variable 'sgml-default-doctype-name)
(setq sgml-declaration "/usr/local/lib/sgml/html.decl"
sgml-default-doctype-name "html"
sgml-indent-step 0
sgml-indent-data nil
sgml-minimize-attributes nil
sgml-omittag t
sgml-shorttag t))
(if window-system
(progn
(setq sgml-markup-faces '((start-tag . font-lock-keyword-face)
(end-tag . font-lock-keyword-face)
(comment . font-lock-comment-face)
(pi . font-lock-string-face)
(sgml . font-lock-reference-face)
(doctype . font-lock-variable-name-face)
(entity . font-lock-function-name-face)
(shortref . font-lock-type-face)))
(add-hook 'sgml-mode-hook
(function
(lambda ()
(require 'font-lock)
(font-lock-make-faces))))))
; end psgml stuff
See the
psgml info pages for details.