XML, the Extensible Markup Language, lets developers create their own formats for storing and sharing information. Using that freedom, developers have created documents representing an incredible range of information, and XML can ease many different information-sharing problems. A key part of this process is formal declaration and documentation of those formats, providing a foundation on which software developers can build software.
An XML schema language is a formalization of the constraints, expressed as rules or a model of structure, that apply to a class of XML documents. In many ways, schemas serve as design tools, establishing a framework on which implementations can be built. Since formalization is a necessary ground for software designers, formalizing the constraints and structures of XML instance documents can lead to very diverse applications. Although new applications for schemas are being invented every day, most of them can be classified as validation, documentation, query, binding, or editing.
Validation is the most common use for schemas in the XML world. There are many reasons and opportunities to validate an XML document: when we receive one, before importing data into a legacy system, when we have produced or hand-edited one, to test the output of an application, etc. In all these cases, a schema helps to accomplish a substantial part of the job. Different kinds of schemas perform different kinds of validation, and some especially complex rules may be better expressed in procedural code rather than in a descriptive schema, but validation is generally the initial purpose of a schema, and often the primary purpose as well.
Validation can be considered a “firewall” against the diversity of XML. We need such firewalls principally in two situations: to serve as actual firewalls when we receive documents from the external world (as is commonly the case with Web Services and other XML communications), and to provide check points when we design processes as pipelines of transformations. By validating documents against schemas, you can ensure that the documents’ contents conform to your expected set of rules, simplifying the code needed to process them.
Validation of documents can substantially reduce the risk of processing XML documents received from sources beyond your control. It doesn’t remove either the need to follow the administration rules of your chosen communication protocol or the need to write robust applications, but it’s a useful additional layer of tests that fits between the communications interface and your internal code.
Validation can take place at several levels. Structural validation makes certain that XML element and attribute structures meet specified requirements, but doesn’t clarify much about the textual content of those structures. Data validation looks more closely at the contents of those structures, ensuring that they conform to rules about what type of information should be present. Other kinds of validation, often called business rules, may check relationships between information and a higher level of sanity-checking, but this is usually the domain of procedural code, not schema-based validation.
XML is a good foundation for pipelines of transformations using widely available tools. Since each of these transformations introduces a risk of error, and each error is easier to fix when detected near its source, it is good practice to introduce check points in the pipeline where the documents are validated. Some applications will find that validating after each step is an overhead cost they can’t bear, while others will find that it is crucial to detect the errors just as they happen, before they can cause any harm and when they are still easy to diagnose. Different situations may have different validation requirements, and it may make sense to validate more heavily during pipeline design than during production deployment.
XML schemas are frequently used to document XML vocabularies, even when validation isn’t a requirement. Schemas provide a formal description of the vocabulary with a precision and conciseness that can be difficult to achieve in prose. It is very unusual to publish the specification of a new XML vocabulary without attaching some form of XML schema.
The machine-readability of schemas gives them several advantages as documentation. Human-readable documentation can be generated from the schema’s formal description. Schema IDEs, for instance, provide graphical views that help to understand the structure of the documents. Developers can also create XSLT transformations that generate a description of the structure. (This technique was used to generate the structure of Chapters 15 and 16 from the W3C XML Schema for W3C XML Schema published on the W3C web site.)
We will see, in Chapter 14, that W3C XML Schema has introduced additional facilities to annotate schemas with both structured or unstructured information, making it easier to use schemas explicitly as a documentation framework.
The first versions of XPath and XSLT were defined to work without any explicit understanding of the structure of the documents being manipulated. This has worked well, but has imposed performance and functionality limits. Knowledge of the document’s structure could improve the efficiency of optimizers, and some functions, such as sorts and equality testing, may be improved by a datatype system. The second version of XPath and XSLT and the first version of XQuery (a new specification defining an XML query language that is still a work in progress) will rely on the availability of a W3C XML Schema for those features.
Although it isn’t especially difficult to write applications that process XML documents using the SAX, DOM, and similar APIs, it is a low-level task, both repetitive and error-prone. The cost of building and maintaining these programs grows rapidly as the number of elements and attributes in a vocabulary grows. The idea of automating these through “binding” the information available in XML documents directly into the structures of applications (generally as objects or RDBMS tables) is probably as old as markup.
Ronald Bourret, who maintains a list of XML Data Binding Resources at http://www.rpbourret.com/xml/XMLDataBinding.htm, makes a distinction between design time and runtime binding tools. While runtime binding tools do their best to perform a binding based on the structure of the documents and applications discovered by introspection, design time binding tools rely on a model formalized in a schema of some kind. He describes this category as “usually more flexible in the mappings they can support.”
Many different languages, either specific or general-purpose XML schema languages, define these bindings. W3C XML Schema has a lot of traction in this area; many data-binding tools were started to support W3C XML Schema for even its early releases, well before the specification was finalized.
XML editors (and SGML editors before them) have long used schemas to present users with appropriate choices over the course of document creation and editing. While DTDs provided structural information, recent XML schema languages add more sophisticated structural information and datatype information.
The W3C is creating a standard API that can be used by guided editing applications to ask a schema processor which action can be performed at a certain location in a document—for instance: “Can I insert this new element here?”, “Can I update this text node to this value?”, etc. The Document Object Model (DOM) Level 3 Abstract Schemas and Load and Save Specification (which is still a work in progress) defines “Abstract Schemas” generic enough to cover both DTDs and W3C XML Schema (and potentially other schema languages as well). When finalized and widely adopted, this API should allow you to plug the schema processor of your choice into any editing application.
Another approach to editing applications builds editors from the information provided in schemas. Combined with information about presentation and controls, these tools let users edit XML documents in applications custom-built for a particular schema. For example, the W3C XForms specification (which is still a work in progress) proposes to separate the logic and layout of the form from the structure of the data to edit, and relies on a W3C XML Schema to define this structure.
XML 1.0 included a set of tools for defining XML document structures, called Document Type Definitions (DTDs). DTDs provide a set of tools for defining which element and attribute structures are permitted in a document, as well as mechanisms for providing default values for attributes, defining reusable content (entities), and some kinds of metadata information (notations). While DTDs are widely supported and used, many XML developers quickly outgrew the capabilities DTDs provide. An alternative schema proposal, XML-Data, was even submitted to the W3C before XML 1.0 was a Recommendation.
The World Wide Web Consortium (W3C), keeper of the XML specification, sought to build a new language for describing XML documents. It needed to provide more precision in describing document structures and their contents, to support XML namespaces, and to use an XML vocabulary to describe XML vocabularies. The W3C’s XML Schema Working Group spent two years developing two normative Recommendations, XML Schema Part 1: Structures, and XML Schema Part 2: Datatypes, along with a nonnormative Recommendation, XML Schema Part 0: Primer.
W3C XML Schema is designed to support all of these applications. An initial set of requirements, formally described in the XML Schema Requirements Note (http://www.w3.org/TR/NOTE-xml-schema-req), listed a wide variety of usage scenarios for schemas as well as for the design principles that guided its creation.
In the rest of this book, we explore the details of W3C XML Schema and its many capabilities, focusing on how to apply it to specific XML document situations.