Words, words, mere words, no matter from the heart.
In this section, we talk about DTDs and XML Schema, two ways to enforce rules in an XML document. A DTD is a simple grammar guide for an XML document, defining which tags may appear where, in what order, with what attributes, etc. XML Schema is the next generation of DTD. With XML Schema, you can describe the data content of the document as well as the structure. XML Schemas are written in terms of primitives, such as numbers, dates, and simple regular expressions, and also allow the user to define complex types in a grammar-like fashion. The word schema means a blueprint or plan for structure, so we’ll refer to DTDs and XML Schema collectively as schema where either applies.
DTDs, although much more limited in capability, are still widely
used. This may be partly due to the complexity involved in writing XML
Schemas by hand. The W3C XML Schema standard is verbose and cumbersome,
which may explain why several alternative syntaxes have sprung up. The
javax.xml.validation
API
performs XML validation in a pluggable way. Out of the box, it supports
only W3C XML Schema, but new schema languages can be added in the future.
Validating with a DTD is supported as an older feature directly in the SAX
parser. We’ll use both in this section.
XML’s validation of documents is a key piece of what makes it useful as a data format. Using a schema is somewhat analogous to the way Java classes enforce type checking in the language. A schema defines document types. Documents conforming to a given schema are often referred to as instance documents of the schema.
This type safety provides a layer of protection that eliminates having to write complex error-checking code. However, validation may not be necessary in every environment. For example, when the same tool generates XML and reads it back in a short time span, validation may not be necessary. It is invaluable, though, during development. Sometimes document validation is used during development and turned off in production environments.
The DTD language is fairly simple. A DTD is primarily a
set of special tags that define each element in the document and, for
complex types, provide a list of the elements it may contain. The DTD
<!ELEMENT>
tag consists of the
name of the tag and either a special keyword for the data type or a
parenthesized list of elements.
<!
ELEMENT
Name
(
#
PCDATA
)>
<!
ELEMENT
Document
(
Head
,
Body
)>
The special identifier #PCDATA
(parsed
character data) indicates a string. When a list is provided, the
elements are expected to appear in that order. The list may contain
sublists, and items may be made optional using a vertical bar (|
) as an OR operator. Special notation can
also be used to indicate how many of each item may appear; two examples
of this notation are shown in Table 24-4.
Attributes of an element are defined with the <!ATTLIST>
tag.
This tag enables the DTD to enforce rules about attributes. It accepts a
list of identifiers and a default value:
<!
ATTLIST
Animal
animalClass
(
unknown
|
mammal
|
reptile
)
"unknown"
>
This ATTLIST
says that the
animal
element has an animalClass
attribute that can have one of
several values (e.g.: unknown
,
mammal
, reptile
). The default is unknown
.
We won’t cover everything you can do with DTDs here. But the following example will guarantee zooinventory.xml follows the format we’ve described. Place the following in a file called zooinventory.dtd (or grab this file from http://oreil.ly/Java_4E):
<!
ELEMENT
inventory
(
animal
*
)>
<!
ELEMENT
animal
(
name
,
species
,
habitat
,
(
food
|
foodRecipe
),
temperament
,
weight
)>
<!
ATTLIST
animal
animalClass
(
unknown
|
mammal
|
reptile
|
bird
|
fish
)
"unknown"
>
<!
ELEMENT
name
(
#
PCDATA
)>
<!
ELEMENT
species
(
#
PCDATA
)>
<!
ELEMENT
habitat
(
#
PCDATA
)>
<!
ELEMENT
food
(
#
PCDATA
)>
<!
ELEMENT
weight
(
#
PCDATA
)>
<!
ELEMENT
foodRecipe
(
name
,
ingredient
+
)>
<!
ELEMENT
ingredient
(
#
PCDATA
)>
<!
ELEMENT
temperament
(
#
PCDATA
)>
The DTD says that an inventory
consists of any number of animal
elements. An animal
has a name
, species
, and habitat
tag followed by either a food
or foodRecipe
. foodRecipe
’s structure is further defined
later.
To use a DTD, we associate it with the XML document. We can do
this by placing a DOCTYPE
declaration in
the XML document itself and allow the XML parser to recognize and
enforce it. The Java validation API that we’ll talk about in the next
section separates the roles of parsing and validation and can be used to
validate arbitrary XML against any kind of schema, including DTDs. The
problem is that out of the box, the validation API only implements the
(newer) XML schema syntax. So we’ll have to rely on the parser to
validate the DTD for us here.
In this case, when a validating parser encounters the DOCTYPE
, it attempts to load the DTD and
validate the document. There are several forms the DOCTYPE
can have, but the one we’ll use
is:
<!
DOCTYPE
Inventory
SYSTEM
"zooinventory.dtd"
>
Both SAX and DOM parsers can automatically validate documents as
they read them, provided that the documents contain a DOCTYPE
declaration. However, you have to
explicitly ask the parser factory to provide a parser that is capable of
validation. To do this, just set the validating property of the parser
factory to true
before you ask it for
an instance of the parser. For example:
...
SAXParserFactory
factory
=
SAXParserFactory
.
newInstance
();
factory
.
setValidating
(
true
);
Again, this setValidating()
method
is an older, more simplistic way to enable validation of documents that
contain DTD references and it is tied to the parser. The new validation
package that we’ll discuss later is independent of the parser and more
flexible. You should not use the parser-validating method in combination
with the new validation API unless you want to validate documents twice
for some reason.
Try inserting the setValidating()
line in our model builder
example after the factory is created. Abuse the
zooinventory.xml file by adding or removing an
element or attribute and then see what happens when you run the example.
You should get useful error messages from the parser indicating the
problems and parsing should fail. To get more information about the
validation, we can register an org.xml.sax.ErrorHandler
object with the
parser, but by default, Java installs one that simply prints the errors
for us.
Although DTDs can define the basic structure of an XML document, they don’t provide a very rich vocabulary for describing the relationships between elements and say very little about their content. For example, there is no reasonable way with DTDs to specify that an element is to contain a numeric type or even to govern the length of string data. The XML Schema standard addresses both the structural and data content of an XML document. It is the next logical step and it (or one of the competing schema languages with similar capabilities) should replace DTDs in the future.
XML Schema brings the equivalent of strong typing to XML by drawing on many predefined primitive element types and allowing users to define new complex types of their own. These schemas even allow for types to be extended and used polymorphically, like types in the Java language. Although we can’t cover XML Schema in any detail, we’ll present the equivalent W3C XML Schema for our zooinventory.xml file here:
<?
xml
version
=
"1.0"
encoding
=
"UTF-8"
?>
<
xs:
schema
xmlns:
xs
=
"http://www.w3.org/2001/XMLSchema"
>
<
xs:
element
name
=
"inventory"
>
<
xs:
complexType
>
<
xs:
sequence
>
<
xs:
element
maxOccurs
=
"unbounded"
ref
=
"animal"
/>
</
xs:
sequence
>
</
xs:
complexType
>
</
xs:
element
>
<
xs:
element
name
=
"name"
type
=
"xs:string"
/>
<
xs:
element
name
=
"animal"
>
<
xs:
complexType
>
<
xs:
sequence
>
<
xs:
element
ref
=
"name"
/>
<
xs:
element
name
=
"species"
type
=
"xs:string"
/>
<
xs:
element
name
=
"habitat"
type
=
"xs:string"
/>
<
xs:
choice
>
<
xs:
element
name
=
"food"
type
=
"xs:string"
/>
<
xs:
element
ref
=
"foodRecipe"
/>
</
xs:
choice
>
<
xs:
element
name
=
"temperament"
type
=
"xs:string"
/>
<
xs:
element
name
=
"weight"
type
=
"xs:double"
/>
</
xs:
sequence
>
<
xs:
attribute
name
=
"animalClass"
default
=
"unknown"
>
<
xs:
simpleType
>
<
xs:
restriction
base
=
"xs:token"
>
<
xs:
enumeration
value
=
"unknown"
/>
<
xs:
enumeration
value
=
"mammal"
/>
<
xs:
enumeration
value
=
"reptile"
/>
<
xs:
enumeration
value
=
"bird"
/>
</
xs:
restriction
>
</
xs:
simpleType
>
</
xs:
attribute
>
</
xs:
complexType
>
</
xs:
element
>
<
xs:
element
name
=
"foodRecipe"
>
<
xs:
complexType
>
<
xs:
sequence
>
<
xs:
element
ref
=
"name"
/>
<
xs:
element
maxOccurs
=
"unbounded"
name
=
"ingredient"
type
=
"xs:string"
/>
</
xs:
sequence
>
</
xs:
complexType
>
</
xs:
element
>
</
xs:
schema
>
This schema would normally be placed into an XML Schema Definition
file, which has a .xsd extension. The first thing to
note is that this schema file is a normal, well-formed XML file that
uses elements from the W3C XML Schema namespace. In it, we use nested
element
declarations to define the
elements that will appear in our document. As with most languages, there
is more than one way to accomplish this task. Here, we have broken out
the “complex” animal
and foodRecipe
elements into their own separate
element declarations and referred to them in their parent elements using
the ref
attribute. In this
case, we did it mainly for readability; it would have been legal to have
one big, deeply nested element declaration starting at inventory
. However, referring to elements by
reference in this way also allows us to reuse the same element
declaration in multiple places in the document, if needed. Our name
element is a small example of this.
Although it didn’t do much for us here, we have broken out the name
element and referred to it for both the
Animal
/Name
and the FoodRecipe
/Name
. Breaking out name
like this would allow us to use more
advanced features of schema and write rules for what a name
can be (e.g., how long, what kind of
characters are allowed) in one place and reuse that “type” where
needed.
Control directives like sequence
and choice
allow us to define the structure of the
child elements allowed and attributes like minOccurs
and maxOccurs
let us specify cardinality (how many
instances). The sequence
directive
says that the enclosed elements should appear in the specified order (if
they are required). The choice
directive allows us to specify alternative child elements like food
or foodRecipe
. We declared the legal values for
our animalClass
attribute using a
restriction
declaration and enumeration
tags.
Although we’ve not really exercised it here, the type
attribute of our
elements touches on the standardization of types in XML Schema. All of
our “text” elements specify a type xs:string
, which is a standard XML Schema
string type (kind of equivalent to PCDATA in our DTD). There are many
other standard types covering things such as dates, times, periods,
numbers, and even URLs. These are called simple
types (though some of them are not so simple) because they
are standardized or “built-in.” Table 24-5 lists W3C Schema simple types
and their corresponding Java types. The correspondence will become
useful later when we talk about JAXB and automated binding of XML to
Java classes.
Table 24-5. W3C Schema simple types
Schema element type | Java type | Example |
---|---|---|
| | |
| | |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
| | |
| | |
| | |
| | |
| | |
| | |
|
|
For example, we have a floating-point weight
element like this in our animal
:
<
Weight
>
400.5
</
Weight
>
We can now validate it in our schema by inserting the following entry at the appropriate place:
<
xs:
element
name
=
"weight"
type
=
"xs:double"
/>
In addition to enforcing that the content of elements matches these simple types, XML Schema can give us much more control over the text and values of elements in our document using simple rules and patterns analogous to regular expressions.
In addition to the predefined simple types listed in Table 24-5, we can define our own,
complex types in our schema. Complex types are
element types that have internal structure and possibly child
elements. Our inventory
, animal
, and foodRecipe
elements are all complex types
and their content must be declared with the complexType
tag in
our schema. Complex type definitions can be reused, similar to the way
that element definitions can be reused in our schema; that is, we can
break out a complex type definition and give it a name. We can then
refer to that type by name in the type
attributes of
other elements. Because all of our complex types were only used once
in their corresponding elements, we didn’t give them names. They were
considered anonymous type definitions, declared
and used in the same spot. For example, we could have separated our
animal
’s type from its element
declaration, like so:
<
xs:
element
name
=
"inventory"
>
<
xs:
complexType
>
<
xs:
sequence
>
<
xs:
element
name
=
"animal"
maxOccurs
=
"unbounded"
type
=
"AnimalType"
/>
</
xs:
sequence
>
</
xs:
complexType
>
</
xs:
element
>
<
xs:
complexType
name
=
"AnimalType"
>
<
xs:
sequence
>
<
xs:
element
ref
=
"name"
/>
<
xs:
element
name
=
"species"
type
=
"xs:string"
/>
<
xs:
element
name
=
"habitat"
type
=
"xs:string"
/>
...
Declaring the AnimalType
separately from the instance of the animal
element declaration would allow us to
have other, differently named elements with the same structure. For
example, our inventory
element may
hold another element, mainAttraction
, which is a type of animal
with a different tag name.
There’s a lot more to say about W3C XML Schema and they can get quite a bit more complex than our simple example. However, you can do a lot with the few pieces we’ve previously shown. Some tools are available to help you get started. We’ll talk about one called Trang in a moment. For more information about XML Schema, see the W3C’s site or XML Schema by Eric van der Vlist (O’Reilly). In the next section, we’ll show how to validate a file or DOM model against the XML Schema we’ve just created, using the new validation API.
Many tools can help you write XML Schema. One helpful tool is called Trang. It is part of an alternative schema language project called RELAX NG (which we mention later in this chapter), but Trang is very useful in and of itself. It is an open source tool that can not only convert between DTDs and XML Schema, but also create a rough DTD or XML Schema by reading an “example” XML document. This is a great way to sketch out a basic, starting schema for your documents.
To use our example’s XML schema, we need to exercise the
new javax.xml.validation
API. As we said earlier, the validation API is an alternative to the
simple, parser-based validation supported through the setValidating()
method
of the parser factories. To use the validation package, we create an
instance of a SchemaFactory
,
specifying the schema language. We can then validate a DOM or stream
source against the schema.
The following example, Validate
, is in the form of a simple
command-line utility that you can use to test out your XML and schemas.
Just give it the XML filename and an XML Schema file
(.xsd file) as arguments:
import
javax.xml.XMLConstants
;
import
javax.xml.validation.*
;
import
org.xml.sax.*
;
import
javax.xml.transform.sax.SAXSource
;
import
javax.xml.transform.Source
;
import
javax.xml.transform.stream.StreamSource
;
public
class
Validate
{
public
static
void
main
(
String
[]
args
)
throws
Exception
{
if
(
args
.
length
!=
2
)
{
System
.
err
.
println
(
"usage: Validate xmlfile.xml xsdfile.xsd"
);
System
.
exit
(
1
);
}
String
xmlfile
=
args
[
0
],
xsdfile
=
args
[
1
];
SchemaFactory
factory
=
SchemaFactory
.
newInstance
(
XMLConstants
.
W3C_XML_SCHEMA_NS_URI
);
Schema
schema
=
factory
.
newSchema
(
new
StreamSource
(
xsdfile
)
);
Validator
validator
=
schema
.
newValidator
();
ErrorHandler
errHandler
=
new
ErrorHandler
()
{
public
void
error
(
SAXParseException
e
)
{
System
.
out
.
println
(
e
);
}
public
void
fatalError
(
SAXParseException
e
)
{
System
.
out
.
println
(
e
);
}
public
void
warning
(
SAXParseException
e
)
{
System
.
out
.
println
(
e
);
}
};
validator
.
setErrorHandler
(
errHandler
);
try
{
validator
.
validate
(
new
SAXSource
(
new
InputSource
(
"zooinventory.xml"
)
)
);
}
catch
(
SAXException
e
)
{
// Invalid Document, no error handler
}
}
}
The schema types supported initially are listed as constants in
the XMLConstants
class. Right now,
only W3C XML Schema is implemented and there is also another intriguing
type in there that we’ll mention later. Our validation example follows
the pattern we’ve seen before, creating a factory, then a Schema
instance. The
Schema
represents the grammar and can
create Validator
instances
that do the work of checking the document structure. Here, we’ve called
the validate()
method on a
SAXSource
, which comes
from our file, but we could just as well have used a DOMSource
to check an in-memory DOM
representation:
validator
.
validate
(
new
DOMSource
(
document
)
);
Any errors encountered will cause the validate method to throw a
SAXException
, but this
is just a coarse means of detecting errors. More generally, and as we’ve
shown in this example, we’d want to register an ErrorHandler
object with the validator
. The error handler can be told about
many errors in the document and convey more information. When the error
handler is present, the exceptions are given to it and not thrown from
the validate method.
The errors generated by these parsers can be a bit cryptic. In some cases, the errors may not be able to report line numbers because the validation is not necessarily being done against a stream.
In addition to DTDs and W3C XML Schema, several other
popular schema languages are being used today. One interesting
alternative that is tantalizingly referenced in the XMLConstants
class is
called RELAX NG. This schema language offers the most widely used
features of XML Schema in a more human-readable format. In fact, it
offers both a very compact, non-XML syntax and a regular XML-based
syntax. RELAX NG doesn’t offer the same text pattern and value
validation that W3C XML Schema does. Instead, these aspects of
validation are left to other tools (many people consider this to be
“business logic,” more appropriately implemented outside of the schema
anyway). If you are interested in exploring other schema languages, be
sure to check out RELAX NG and its useful schema conversion utility,
Trang.
Get Learning Java, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.