The Evolution of XML

XML is a descendant of SGML, the Standard Generalized Markup Language. The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by several hundred people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way XML solves them. It is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.

SGML’s biggest success was HTML, which is an SGML application. However, HTML is just one SGML application. It does not have or offer anywhere near the full power of SGML itself. Since it restricts authors to a finite set of tags designed to describe web pages—and describes them in a fairly presentation oriented way at that—it’s really little more than a traditional markup language that has been adopted by web browsers. It doesn’t lend itself to use beyond the single application of web page design. You would not use HTML to exchange data between incompatible databases or to send updated product catalogs to retailer sites, for example. HTML does web pages, and it does them very well, but it only does web pages.

SGML was the obvious choice for other applications that took advantage of the Internet but were not simple web pages for humans to read. The problem was that SGML is complicated—very, very complicated. The official SGML specification is over 150 very technical pages. It covers many special cases and unlikely scenarios. It is so complex that almost no software has ever implemented it fully. Programs that implemented or relied on different subsets of SGML were often incompatible with each other. The special feature one program considered essential would be considered extraneous fluff and omitted by the next program.

In 1996, Jon Bosak, Tim Bray, C. M. Sperberg-McQueen, James Clark, and several others began work on a “lite” version of SGML that retained most of SGML’s power while trimming a lot of the features that had proven redundant, too complicated to implement, confusing to end users, or simply not useful over the previous 20 years of experience with SGML. The result, in February of 1998, was XML 1.0, and it was an immediate success. Many developers who knew they needed a structural markup language but hadn’t been able to bring themselves to accept SGML’s complexity adopted XML whole-heartedly. It was used in domains ranging from legal court filings to hog farming.

However, XML 1.0 was just the beginning. The next standard out of the gate was Namespaces in XML, an effort to allow markup from different XML applications to be used in the same document without conflicting. Thus a web page about books could have a title element that referred to the title of the page and title elements that referred to the title of a book, and the two would not conflict.

Next up was the Extensible Stylesheet Language (XSL), an XML application for transforming XML documents into a form that could be viewed in web browsers. This soon split into XSL Transformations (XSLT) and XSL Formatting Objects (XSL-FO). XSLT has become a general-purpose language for transforming one XML document into another, whether for web page display or some other purpose. XSL-FO is an XML application for describing the layout of both printed pages and web pages that approaches PostScript for its power and expressiveness.

However, XSL is not the only option for styling XML documents. Cascading Style Sheets (CSS) were already in use for HTML documents when XML was invented, and they proved to be a reasonable fit to XML as well. With the advent of CSS Level 2, the W3C made styling XML documents an explicit goal for CSS. The pre-existing Document Style Sheet and Semantics Language (DSSSL) was also adopted from its roots in the SGML world to style XML documents for print and the Web.

The Extensible Linking Language, XLink, began by defining more powerful linking constructs that could connect XML documents in a hypertext network that made HTML’s A tag look like it is an abbreviation for “anemic.” It also split into two separate standards: XLink for describing the connections between documents and XPointer for addressing the individual parts of an XML document. At this point, it was noticed that both XPointer and XSLT were developing fairly sophisticated yet incompatible syntaxes to do exactly the same thing: identify particular elements in an XML document. Consequently, the addressing parts of both specifications were split off and combined into a third specification, XPath. A little later yet another part of XLink budded off to become XInclude, a syntax for building complex documents by combining individual documents and document fragments.

Another piece of the puzzle was a uniform interface for accessing the contents of the XML document from inside a Java, JavaScript, or C++ program. The simplest API was merely to treat the document as an object that contained other objects. Indeed, work was already underway inside and outside the W3C to define such a Document Object Model (DOM) for HTML. Expanding this effort to cover XML was not hard.

Outside the W3C, David Megginson, Peter Murray-Rust, and other members of the xml-dev mailing list recognized that third-party XML parsers, while all compatible in the documents they could parse, were incompatible in their APIs. This led to the development of the Simple API for XML, or SAX. In 2000, SAX2 was released to add greater configurability and namespace support, and a cleaner API.

One of the surprises during the evolution of XML was that developers adopted it more for record-like structures, such as serialized objects and database tables, than for the narrative structures for which SGML had traditionally been used. DTDs worked very well for narrative structures, but they had some limits when faced with the record-like structures developers were actually creating. In particular, the lack of data typing and the fact that DTDs were not themselves XML documents were perceived as major problems. A number of companies and individuals began working on schema languages that addressed these deficiencies. Many of these proposals were submitted to the W3C, which formed a working group to try to merge the best parts of all of these and come up with something greater than the sum of its parts. In 2001, this group released Version 1.0 of the W3C XML Schema Language. Unfortunately, this language proved overly complex and burdensome. Consequently, several developers went back to the drawing board to invent cleaner, simpler, more elegant schema languages, including RELAX NG and Schematron.

Eventually, it became apparent that XML 1.0, XPath, the W3C XML Schema Language, SAX, and DOM all had similar but subtly different conceptual models of the structure of an XML document. For instance, XPath and SAX don’t consider CDATA sections to be anything more than syntax sugar, but DOM does treat them differently than plain-text nodes. Thus, the W3C XML Core Working Group began work on an XML Information Set that all these standards could rely on and refer to.

As more and more XML documents of higher and higher value began to be transmitted across the Internet, a need was recognized to secure and authenticate these transactions. Besides using existing mechanisms such as SSL and HTTP digest authentication built into the underlying protocols, formats were developed to secure the XML documents themselves that operate over a document’s entire life span rather than just while it’s in transit. XML encryption, a standard XML syntax for encrypting digital content, including portions of XML documents, addresses the need for confidentiality. XML Signature, a joint IETF and W3C standard for digitally signing content and embedding those signatures in XML documents, addresses the problem of authentication. Because digital signature and encryption algorithms are defined in terms of byte sequences rather than XML data models, both XML Signature and XML Encryption are based on Canonical XML, a standard serialization format that removes all insignificant differences between documents, such as whitespace inside tags and whether single or double quotes delimit attribute values.

Through all this, the core XML 1.0 specification remained unchanged. All of this new functionality was layered on top of XML 1.0 rather than modifying it at the foundation. This is a testament to the solid design and strength of XML. However, XML 1.0 itself was based on Unicode 2.0, and as Unicode continued to evolve and add new scripts such as Mongolian, Cambodian, and Burmese, XML was falling behind. Primarily for this reason, XML 1.1 was released in early 2004. It should be noted, however, that XML 1.1 offers little to interest developers working in English, Spanish, Japanese, Chinese, Arabic, Russian, French, German, Dutch, or the many other languages already supported in Unicode 2.0.

Doubtless, many new extensions of XML remain to be invented. And even this rich collection of specifications only addresses technologies that are core to XML. Much more development has been done and continues at an accelerating pace on XML applications, including SOAP, SVG, XHTML, MathML, Atom, XForms, WordprocessingML, and thousands more. XML has proven itself a solid foundation for many diverse technologies.

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.