Markup technology has a long and rich history. In the 1960s, while developing an integrated document storage, editing, and publishing system at IBM, Charles Goldfarb, Edward Mosher, and Raymond Lorie devised a text-based markup format. It extended the concepts of generic coding (block-level tagging that was both machine-parsable and meaningful to human authors) to include formal, nested elements that defined the type and structure of the document being processed. This format was called the Generalized Markup Language (GML). GML was a success, and as it was more widely deployed, the American National Standards Institute (ANSI) invited Goldfarb to join its Computer Languages for Text Processing committee to help develop a text description standard-based GML. The result was the Standard Generalized Markup Language (SGML). In addition to the flexibility and semantic richness offered by GML, SGML incorporated concepts from other areas of information theory; perhaps most notably, inter-document link processing and a practical means to programmatically validate markup documents by ensuring that the content conformed to a specific grammar. These features (and many more) made SGML a natural and capable fit for larger organizations that needed to ensure consistency across vast repositories of documents. By the time the final ISO SGML standard was published in 1986, it was in heavy use by bodies as diverse as the Association of American Publishers, the U.S. Department of Defense, and the European Laboratory for Particle Physics (CERN).
In 1990, while developing a linked information system for CERN, Tim Berners-Lee hit on the notion of creating a small, easy-to-learn subset of SGML. It would allow people who were not markup experts to easily publish interconnected research documents over a network—specifically, the Internet. The Hypertext Markup Language (HTML) and its sibling network technology, the Hypertext Transfer Protocol (HTTP) were born. Four years later, after widespread and enthusiastic adoption of HTML by academic research circles throughout the globe, Berners-Lee and others formed the World Wide Web Consortium (W3C) in an effort to create an open but centralized organization to lead the development of the Web.
Without a doubt, HTML brought markup technology into the mainstream. Its simple grammar, combined with a proliferation of HTML-specific markup presentation applications (web browsers) and public commercial access to the Internet sparked what can only be called a popular electronic markup publishing explosion. No longer was markup solely the domain of information technology specialists working with complex, mainframe-based publishing tools inside the walls of huge organizations. Anyone with a home PC, a dial-up Internet account, and patience to learn HTML’s intentionally forgiving syntax and grammar could publish his own rich hypertext documents for the rest of the wired world to see and enjoy.
HTML made markup popular, but it was a single, predefined grammar that only indicated how a document was to be presented visually in a web browser. That meant much of the flexibility offered by markup technology, in general, was simply lost. All the markup reliably communicated was how the document was supposed to look, not what it was supposed to mean. In the mid-1990s, work began at the W3C to create a new subset of SGML for use on the Web—one that provided the flexibility and best features of its predecessor but could be processed by faster, lighter tools that reflected the needs of the emerging web environment. In 1996, W3C members Tim Bray and C. M. Sperberg-McQueen presented the initial draft for this new “simplified SGML for Web”—the Extensible Markup Language (XML). Two years later in 1998, after much discussion and rigorous review, the W3C published XML 1.0 as an official recommendation.
In the six years since, interest in XML has steadily grown. While not as ubiquitous as some claim, tools to process XML are available for the most popular programming languages, and XML has been used in some fairly novel (though sometimes not always appropriate) ways. Given its generic nature, inherent flexibility, and ways in which it has (or can be) used, XML is hard to pigeonhole. It remains largely an enigma to many developers. At its core, XML is nothing, more or less, than a text-based format for applying structure to documents and other data. Uses for XML are (and will continue to be) many and varied, but looking back at its history helps to provide a reasonable context—a history inextricably bound to automated document publishing.
Many people, especially those coming to XML from a web-development background, seem to expect that it is either intended to replace HTML or that it is somehow HTML: The Next Generation—neither is the case. Although both are markup languages, HTML defines a specific markup grammar (set of elements, allowed structures) intended for consumption by a single type of application: an HTML web browser. XML, on the other hand, does not define a grammar at all. Rather, it is designed to allow developers to use (or create) a grammar that best reflects the structure and meaning of the information being captured. In other words, it gives you a clear way to create the rich, reusable source content crucial to modern adaptive web-publishing systems.
To understand the value of using a more semantically meaningful markup grammar, consider the task of publishing a poetry collection. If you know HTML and want to get the collection onto the Web quickly, you could create a document, such as the one shown in Example 1-1, for each poem.
Example 1-1. poem.html
<html> <head> <title>Post-Geek-chic Folk Poetry Collection</title> </head> <body> <h1>An Ode To Directed Acyclic Graphs</h1> <p><i>by: Anonymous</i></p> <p> I think that I shall never see, <br> a document that cannot be represented as a tree. </p> </body> </html>
If your only goal is to publish your poetic gems on the Web for
people to view in a browser, then once you upload the documents to
the right location on an appropriate server somewhere, the job is
done. What if you want to do more? At the very least, you will
probably want an index document containing a list of links to the
poems in your collection. If the collection remains small and time is
not a consideration, you could create this index by hand. More
likely, though, because you are a professional web developer, you
would probably create a small script to extract information (title
and author) from the poems themselves to create the index document
programatically. That’s when the weakness in your
approach begins to show. Specifically, using HTML to mark up your
poetry only gave you a way to present the work visually. In your
attempt to extract the title and author’s name, you
are forced to impose meaning based solely on inference and your
knowledge of the conventions used when marking up the poems. You can
infer that the first
element contains the title of the poem, but nothing states this
explicitly. You must trust that all poems in the collection will
follow the same structure. In the best case, you can only guess and
hope that your guess holds up in the long run.
Marking up your poetry collection in XML can help you avoid such
ambiguities. It is not the use of XML, per se,
that helps. Rather, XML gives you a familiar syntax (nested
angle-bracketed tags with attributes, such as those in HTML) while
offering the flexibility to choose a grammar that more intimately
describes the structure and meaning of the content. It would help
simplify your indexing script, for example, if something like an
author element contained the
author’s name. You would not have to rely on an
unstable heuristic such as “the string that follows
the word `by,’ optionally contained
i element, that is in the first
p element after the first
element in the document” to extract the data.
Essentially, you want to use a more exact, domain-specific grammar
whose structures and elements convey the meaning of the data. XML
provides a means to do that.
Not surprisingly, marking up poetic content is a task that others
before you have faced. A quick web search reveals several XML
grammars designed for this purpose. A short evaluation of each
reveals that the
Definition (DTD) from Project Gutenberg (a volunteer effort
led by the HTML Writer’s Guild to
make the World’s great literature available as
electronic text) fits your needs nicely. Using the grammar defined by
poemsfrag.dtd, the sample poem from your
collection takes the form shown in Example 1-2.
Example 1-2. poem.xml
<?xml version="1.0"?> <poem> <title>An Ode To Directed Acyclic Graphs</title> <author>Anonymous</author> <verse> <line>I think that I shall never see,</line> <line>a document that cannot be represented as a tree.</line> </verse> </poem>
Using this more specific grammar makes extracting the title and
author data for the index document completely unambiguous—you
simply grab the contents of the
author elements, respectively. In addition, you
can now easily generate other interesting metadata, such as the
number of verses per poem, the average lines per verse, and so on,
without dubious guesswork. Moreover, having an explicit, concrete
Document Type Definition that describes your chosen grammar provides
the chance to programatically validate the structure of each poem you
add to the collection. This helps to ensure the integrity of the data
from the outset.
Choosing the best grammar (or data model, if you must) for your content is crucial: get it right and the tools to process your documents will grow logically from the structure; get it wrong and you will spend the life of the project working around a weak foundation. Designing useful markup grammars that hold up over time is an art in itself; resist the urge to create your own just because you can. Chances are there is already a grammar available for the class of documents you will mark up. Evaluate what’s available. Even if you decide to go your own way, the time spent seeing how others approached the same problem more than pays for itself.
Switching to XML and the
arguably adds significant value to your documents—the structure
reveals (or imposes) the intended meaning of
the content. At the very least, this reduces time wasted on messy
guessing both for those marking up the poems and for those writing
tools to process those poems. However, you lose something, as well.
You can no longer simply upload the documents to a web server and
expect browsers to do the right thing when rendering them (as you
could when they were marked up as HTML). There is a gap between the
grammar that is most useful to us, as authors and tool builders, and
the grammar that an HTML web browser expects. Since
publishing your poetry online was the goal in the first place, unless
you can bridge that gap (and easily too), then really, you take a