You are here: CLASSE Wiki>Computing Web>LocalSGMLTools (24 May 2006, ChristopherTerranova)Edit Attach
Tags

Local SGML Tools

A little background on SGML

A "markup language" is a set of conventions for encoding text. Typical examples would be annotations or other marks which tell a compositor how to layout a section of text (including a system like TeX), proofreader's marks, or structural annotations. Standard Generalized Markup Language (SGML) is an international standard which defines a grammar for constructing markup languages. In a sense, it is a meta-language which a language author uses to construct a markup language for a particular use. Applications of SGML are typically descriptive, in that they describe the document structure (somewhat like the higher level parts of LaTeX) rather than describing the procedural details of how the text should be processed.

Our particular interest in SGML is that the HyperText Markup Language (HTML) used by the World Wide Web is, in principle, an application of SGML. In practice, very few web browsers are based on SGML parsers (unfortunately), and some of the proprietary extensions to HTML (e.g. the Netscape "enhancements") do not conform to SGML standards and hence cannot be properly described by an SGML document type. Regardless, testing your documents against a real SGML parser is still the best way to ensure that they obey the rules for properly constructed HTML and are therefore usable in the widest range of browsers and summarizers (in particular, note that the search engine used on our web server may not fully summarize invalid HTML documents).

The DOCTYPE declaration

To use SGML tools on your HTML documents, you must include a valid SGML DOCTYPE declaration that tells the SGML parser what document type definition (DTD) to use to parse your documents. A list of the locally supported DOCTYPE declarations can be found in the file /usr/local/lib/sgml/catalog. For IETF standard HTML 2.0, use

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

or

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Strict//EN">

as the first line of your document.

For the (expired) HTML 3.0 draft, I have set the catalog to accept

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3 1995-03-24//EN">

or

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.0//EN">

These may confuse some validation services which expect to see IETF declared as the owner of the HTML 3.0 DTD, but since 3.0 is no longer a standards track document listing IETF as the owner is strictly incorrect.

For HTML 3.2, use

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

If you insist on using Netscapisms, you can try the doctype

<!DOCTYPE HTML PUBLIC "-//Netscape Comm. Corp.//DTD HTML//EN">

Be warned that this is not an official DTD--Netscape has never issued a DTD for their features, or even given a sufficiently precise description of them to unambiguously specify their grammar. This DTD also does not reflect many of the latest Netscape features.

Using the SGML tools

Checking your document with nsgmls

Once you have a DOCTYPE declaration, you can use the SGML parser from the sp package, nsgmls, to validate your document. There is a simple script, html-ncheck, which takes care of the command line options for simple HTML document validation, so just do

html-ncheck file.html

sp can also normalize your HTML document. Normalization inserts omitted end tags, expands minimized attributes, etc., producing a version of your document with all the hidden assumptions revealed. This can be very useful for find out where your paragraphs really begin and end and identifying redundant markup. Some SGML tools may also require input in full normalized form. The script html-spam provides a simplified interface to the sp normalizer spam.

Formatting with gf

gf can convert documents written in some SGML doctypes (including HTML) into a variety of presentation formats, including LaTeX, texinfo, and plain text. gf uses nsgmls to parse the input document, so what you feed it must be fully conformant HTML 2.0 with a valid doctype declaration. See the gf info pages for usage.

Editing with psgml

psgml is an emacs major mode for editing SGML. Since psgml is driven by the SGML DTD, it actually knows what elements and attributes are valid at any point in your document, can normalize your document, and can find "trouble spots". To use psgml mode for HTML, add something like this to your .emacs:

; psgml stuff
(autoload 'sgml-mode "psgml" "Major mode to edit SGML files" t)

(setq sgml-always-quote-attributes t      ; expected by many clients
      sgml-auto-insert-required-elements t
      sgml-indent-data nil
      sgml-indent-step 2
      sgml-auto-activate-dtd t            ; preload dtd
      sgml-omittag nil
      sgml-shorttag nil
      sgml-recompile-out-of-date-cdtd 'ask
      sgml-set-face window-system
      sgml-validate-command "nsgmls -s %s %s")

; menus for creating new documents
(setq sgml-custom-dtd
      '(
   ( "HTML 2.0"
     "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML//EN\">" )
   ( "HTML 2.0 Level 1"
     "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML Level 1//EN//2.0\">" )
   ( "HTML 3.0"
     "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.0//EN\">")
        )
      )

(defun html-mode ()
  (interactive)
  (sgml-mode)
  (make-local-variable 'sgml-declaration)
  (make-local-variable 'sgml-default-doctype-name)
  (setq sgml-declaration             "/usr/local/lib/sgml/html.decl"
        sgml-default-doctype-name    "html"
        sgml-indent-step             0
        sgml-indent-data             nil
        sgml-minimize-attributes     nil
        sgml-omittag                 t
        sgml-shorttag                t))


(if window-system
    (progn
      (setq sgml-markup-faces '((start-tag   . font-lock-keyword-face)
            (end-tag   . font-lock-keyword-face)
            (comment   . font-lock-comment-face)
            (pi      . font-lock-string-face)
            (sgml      . font-lock-reference-face)
            (doctype   . font-lock-variable-name-face)
            (entity   . font-lock-function-name-face)
            (shortref   . font-lock-type-face)))

      (add-hook 'sgml-mode-hook
      (function
       (lambda ()
         (require 'font-lock)
         (font-lock-make-faces))))))
; end psgml stuff

See the psgml info pages for details.
Topic revision: r3 - 24 May 2006, ChristopherTerranova
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback