XML and HTML are called markup languages because of the way they add structure to plain-text documents—by surrounding parts of the text with tags that indicate structure or meaning, much as someone with a pen might highlight a sentence and add a note. While HTML predefines a set of tags and their structure, XML is a blank slate in which the author gets to define the tags, the rules, and their meanings.
Both XML and HTML owe their lineage to Standard Generalized Markup Language (SGML)—the mother of all markup languages. SGML has been used in the publishing industry for decades (including at O’Reilly). But it wasn’t until the Web captured the world that it came into the mainstream through HTML. HTML started as a very small application of SGML, and if HTML has done anything at all, it has proven that simplicity reigns.
When Tim Berners-Lee began postulating the Web back at CERN in the late 1980s, he wanted to organize project information using hypertext with links embedded in plain text.[49] When the Web needed a protocol, HTTP—a simple, text-based client-server protocol—was invented. So, what exactly is so enchanting about the idea of plain text? Why, for example, didn’t Tim turn to the Microsoft Word format as the basis for web documents? Surely a binary, non-human-readable format and a similarly machine-oriented protocol would be more efficient? Since the Web’s inception, there have now been literally trillions of HTTP transactions. Was it really a good idea for them to use (English) words like “GET” and “POST” as part of the protocol?
The answer, as we’ve all seen, is yes! Whatever humans can read and undertstand, human developers can work with more easily. There is a time and place for a high level of optimization (and obscurity), but when the goal is universal acceptance and cross-platform portability, simplicity and transparency are paramount. This is the first fundamental proposition of XML: simple and nominally human-readable data.
Using text to exchange data is not exactly a new idea,
either, but historically, for every new document format that came along,
a new parser would have to be written. A
parser is an application that reads a document and understands its
formatting conventions, usually enforcing some rules about the content.
For example, the Java Properties
class has a parser for the standard properties file format (Chapter 11). In our simple spreadsheet in Chapter 18, we wrote a parser capable of
understanding basic mathematical expressions. As we’ve seen, depending
on complexity, parsing can be quite tricky.
With XML, we can represent data without having to write this kind of custom parser. This isn’t to say that it’s reasonable to use XML for everything (e.g., typing math expressions into our spreadsheet), but for the common types of information that we exchange on the Net, we shouldn’t have to write parsers that deal with basic syntax and string manipulation. In conjunction with document-verifying components (Document Type Definitions [DTDs] or XML Schema), much of the complex error checking is also done automatically. This is the second fundamental proposition of XML: standardized parsing and validation.
The APIs we’ll discuss in this chapter are powerful and popular. They are being used around the world to build enterprise-scale systems every day. In recent years, JAXB Java to XML binding has been vastly streamlined and simplified (primarily through the use of Java annotations to replace configuration files and support a “code first” methodology). However, as with any popular technology, there has been a recognition of its limitations and some complexity has crept into what began as simple concepts. In the area of browser-based applications, some have turned to JavaScript Object Notation (JSON) as an even lighter-weight approach that maps natively to JavaScript, especially for transient communications between client and server. However, XML tools are still widely used in this area as well. Google’s Protocol Buffers-encoding scheme is another example of a system-to-system communication format that has been used in place of XML; in this case, where very high performance trumps flexibility. But XML remains the most powerful general format for document and data exchange with the widest array of tools support.
All the basic APIs for working with XML are now bundled
with the standard release of Java. This included the javax.xml
standard
extension packages for working with Simple API for XML (SAX), Document
Object Model (DOM), XML Binding JAXB, and Extensible Stylesheet Language (XSL)
transforms, as well as APIs such as XPath, and XInclude. If you are using an older
version of Java, you can still use many of these tools but you will have
to download these packages separately.
All modern web browsers support XML explicitly, both in terms of simple rendering of XML content and also client-side transformation of XML into HTML for display. If you load an XML document in you browser it will generally be displayed as a tree with controls to allow you to collapse and expand nodes (like an outline). Displaying XML in this way is used mainly for debugging, but JavaScript can also support client-side XSL transformation directly in the browser. XSL is a language for transforming XML into other documents; we’ll talk about it later in this chapter.
When viewed in older browsers or in contexts that do not explicitly format XML for viewing, the browser will generally simply display the text of the document with all the tags (structural information) stripped off. This is the prescribed behavior for working with unknown XML markup in a viewing environment. Remember that you can always use the “view source” option to display the text of a file in your browser if you want to see the original source.
[49] To read Berners-Lee’s original proposal to CERN, go to http://www.w3.org/History/1989/proposal.html.
Get Learning Java, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.