DTDs
The original XML document model is the Document Type Definition (DTD). DTDs actually predate XML; they are a reduced hand-me-down from SGML with the core syntax almost completely intact. The following describes how a DTD defines a document type.
A DTD declares a set of allowed elements. You cannot use any element names other than those in this set. Think of this as the “vocabulary” of the language.
A DTD defines a content model for each element. The content model is a pattern that tells what elements or data can go inside an element, in what order, in what number, and whether they are required or optional. Think of this as the “grammar” of the language.
A DTD declares a set of allowed attributes for each element. Each attribute declaration defines the name, datatype, default values (if any), and behavior (e.g., if it is required or optional) of the attribute.
A DTD provides a variety of mechanisms to make managing the model easier, for example, the use of parameter entities and the ability to import pieces of the model from an external file.
Document Prolog
According to the XML Recommendation, all external parsed entities (including DTDs) should begin with a text declaration. It looks like an XML declaration except that it explicitly excludes the standalone property. If you need to specify a character set other than the default UTF-8 (see Chapter 9 for more about character sets), or to change the XML version number from the default 1.0, this is where you would do it.
Tip
If you specify a character set in the DTD, it won’t automatically carry over into XML documents that use the DTD. XML documents have to specify their own encodings in their document prologs.
After the text declaration, the resemblance to normal document prologs ends. External parsed entities, including DTDs, must not contain a document type declaration.
Declarations
A DTD is a set of rules or declarations. Each declaration adds a new element, set of attributes, entity, or notation to the language you are describing. DTDs can be combined using parameter entities, a technique called modularization . You can also add declarations inside the internal subset of the document.
The order of the declarations is important in two situations. First, if there are redundant entity declarations, the first one that appears takes precedence and all others are ignored.[2] This is important to know if you are going to override declarations, either in the internal subset or by cascading DTDs. Second, if parameter entities are used in declarations, they must be declared before they are used as references.
Declaration syntax is flexible when it comes to whitespace. You can add extra space anywhere except in the string of characters at the beginning that identifies the declaration type.
For example, these are all acceptable:
<!ELEMENT thingie ALL> <!ELEMENT thingie ALL> <!ELEMENT thingie ( foo | bar | zap )*>
An Example
Imagine a scenario where you are collecting information from a group of people. The data you receive will be fed to a program that will process it and store it in a database. You need a quick way to determine whether all the required information is there before you can accept a submission. For this, we will use a DTD.
The information in this example will be census data. Your staff is roaming around the neighborhood interviewing families and entering data on their laptop computers. They are using an XML editor configured with a DTD that you’ve created to model your language, CensusML. Later, they will upload all the CensusML documents to the central repository to be processed overnight.
Example 4-1 shows how a typical valid CensusML document should look, minus the document prolog. A document represents an interview with one family. It contains a date, an address, and a list of people residing there. For each person, we are interested in taking their full name, age, employment status, and gender. We also use identification numbers for people, to ensure that we don’t accidentally enter in somebody more than once.
<census-record taker="3163"> <date><year>2003</year><month>10</month><day>11</day></date> <address> <street>471 Skipstone Lane <unit>4-A</unit></street> <city>Emerald City</city> <county>Greenhill</county> <country>Oz</country> <postalcode>885JKL</postalcode> </address> <person employed="fulltime" pid="P270405"> <name> <first>Meeble</first> <last>Bigbug</last> <junior/> </name> <age>39</age> <gender>male</gender> </person> <person employed="parttime" pid="P273882"> <name> <first>Mable</first> <last>Bigbug</last> </name> <age>36</age> <gender>female</gender> </person> <person pid="P472891"> <name> <first>Marble</first> <last>Bigbug</last> </name> <age>11</age> <gender>male</gender> </person> </census-record>
Let’s start putting together a DTD. The first declaration is for the document element:
<!ELEMENT census-record (date, address, person+)>
This establishes the first rules for the CensusML language: (1)
there is an element named census-record
and (2) it must contains one
date
element, one address
element, and at least one person
element. If you leave any of these
elements out, or put them in a different order, the document will be
invalid.
Note that the declaration doesn’t actually specify that the
census-record
must be used as the
document element. In fact, a DTD can’t single out any element to be
the root of a document. You might view this as a bad thing, since you
can’t stop someone from submitting an incomplete document containing
only a person
element and nothing
else. On the other hand, you could see it as a feature, where DTDs can
contain more than one model for a document. For example, DocBook
relies on this to support many different models for documents; a book
would use the book
element as its
root, while an article would use the article
element. In any case, be aware of
this loophole.
Now we should declare the attributes for this element. There is
only one, taker
, identifying the
census taker who authored this document. Its type is CDATA
(character data). We will make it
required, because it’s important to know who is submitting the data
just to make sure no mischievous people submit fraudulent records.
Here is the attribute list for census-record
:
<!ATTLIST census-record taker CDATA #REQUIRED>
Next declare the date
element. The order of element declarations doesn’t really matter. All
the declarations are read into the parser’s memory before any
validation takes place, so all that is necessary is that every element
is accounted for. But I like things organized and in the approximate
order, so here’s the next set of declarations:
<!ELEMENT date (year, month, day)> <!ELEMENT year #PCDATA> <!ELEMENT month #PCDATA> <!ELEMENT day #PCDATA>
The #PCDATA
literal
represents character data. Specifically, it matches zero or more
characters. Any element with a content model #PCDATA
can contain character data but not
elements. So the elements year
,
month
, and day
are what you might call data fields. The
date
element, in contrast, must
contain elements, but not character data.[3]
Now for the address bit. address
is a container of elements just like
date
. For the most part, its
subelements are plain data fields (their content is character data
only), but one element has mixed content: street
. Here are the declarations:
<!ELEMENT address (street, city, county, country, postalcode)> <!ELEMENT street (#PCDATA | unit)*> <!ELEMENT city #PCDATA> <!ELEMENT county #PCDATA> <!ELEMENT country #PCDATA> <!ELEMENT postalcode #PCDATA> <!ELEMENT unit #PCDATA>
The declaration for street
follows the pattern used by all mixed-content elements. The #PCDATA
must come first followed by all the
allowed subelements separated by vertical bars (|
). The asterisk (*
) here is required. It means that there can
be zero or more of whatever comes before it. The upshot is that
character data is optional, along with all the elements that can be
interspersed within it.
Alas, there is no way to require that an
element with mixed content contains character data. The census taker
could just leave the street
element
blank and the validating parser would be happy with that. Changing
that asterisk (*
) to a plus (+) to
require some character data is not allowed. To make validation simple
and fast, DTDs never concern themselves with the actual details of
character data.
Our final task is to declare the elements and attributes making
up a person
. Here is a crack at the
element declarations:
<!ELEMENT person (name, age, gender)> <!ELEMENT name (first, last, (junior | senior)?)> <!ELEMENT age #PCDATA> <!ELEMENT gender #PCDATA> <!ELEMENT first #PCDATA> <!ELEMENT last #PCDATA> <!ELEMENT junior EMPTY> <!ELEMENT senior EMPTY> <!ATTLIST person pid ID #REQUIRED employed (fulltime|parttime) #IMPLIED>
The content model is a little more complex for this container.
The first and last names are required, but there is an option to
follow these with a qualifier (“Junior” or “Senior”). The qualifiers
are declared as empty elements here using the keyword EMPTY
and the question mark makes them
optional, as not everyone is a junior or senior. Perhaps it would be
just as easy to make an attribute called qualifier
with values junior
or senior
, but I decided to do it this way to
show you how to declare empty elements. Also, using an element makes
the markup less cluttered, and we already have two attributes in the
container element.
The first attribute declared is a required pid
, a person identification string. Its
type is ID
, which to validating
parsers means that it is a unique identifier within the scope of the
document. No other element can have an ID-type attribute with that
value. This means that if the census taker accidentally puts in a
person twice, the parser will catch the error and report the document
invalid. The parser can only check within the scope of the document,
however, so there is nothing to stop a census taker from entering the
same person in another document.
ID-type attributes have another limitation. There is one identifier-space for all of them, so even if you want to use them in different ways, such as having an identifier for the address and another for people, you can’t use the same string in both element types. A solution to this might be to prefix the identifier string with a code like “HOME-38225” for address and “PID-489294” for person, effectively creating your own separate identifier spaces. Note that ID-type attributes must always begin with a letter or underscore, like XML element and attribute names.
The other attribute, employed
, is optional as denoted by the
#IMPLIED
keyword. It’s also an
enumerated type, meaning that there is a set of allowed values
(fulltime
and parttime
). Setting the attribute to anything
else would result in a validation error.
Example 4-2 shows the complete DTD.
<!-- Census Markup Language (use <census-record> as the document element) --> <!ELEMENT census-record (date, address, person+)> <!ATTLIST census-record taker CDATA #REQUIRED> <!-- date the info was collected --> <!ELEMENT date (year, month, day)> <!ELEMENT year #PCDATA> <!ELEMENT month #PCDATA> <!ELEMENT day #PCDATA> <!-- address information --> <!ELEMENT address (street, city, county, country, postalcode)> <!ELEMENT street (#PCDATA | unit)*> <!ELEMENT city #PCDATA> <!ELEMENT county #PCDATA> <!ELEMENT country #PCDATA> <!ELEMENT postalcode #PCDATA> <!ELEMENT unit #PCDATA> <!-- person information --> <!ELEMENT person (name, age, gender)> <!ELEMENT name (first, last, (junior | senior)?)> <!ELEMENT age #PCDATA> <!ELEMENT gender #PCDATA> <!ELEMENT first #PCDATA> <!ELEMENT last #PCDATA> <!ELEMENT junior EMPTY> <!ELEMENT senior EMPTY> <!ATTLIST person pid ID #REQUIRED employed (fulltime|parttime) #IMPLIED>
Tips for Designing and Customizing DTDs
DTD design and construction is part science and part art form. The basic concepts are easy enough, but managing a large DTD—maintaining hundreds of element and attribute declarations while keeping them readable and bug-free—can be a challenge. This section offers a collection of hints and best practices that you may find useful. The next section shows a concrete example that uses these practices.
Keeping it organized
DTDs are notoriously hard to read, but good organization always helps. A few extra minutes spent tidying up and writing comments can save you hours of scrutinizing later. Often a DTD is its own documentation, so if you expect others to use it, clean code is doubly important.
- Organizing declarations by function
Keep declarations separated into sections by their purpose. In small DTDs, this helps you navigate the file. In larger DTDs, you might even want to break the declarations into separate modules. Some categories to group by are blocks, inlines, hierarchical elements, parts of tables, lists, etc. In Example 4-4, the declarations are divided by function (block, inline, and hierarchical).
- Whitespace
Pad your declarations with lots of whitespace. Content models and attribute lists suffer from dense syntax, so spacing out the parts, even placing them on separate lines, helps make them more understandable. Indent lines inside declarations to make the delimiters more clear. Between logical divisions, use extra space and perhaps a comment with a row of dark characters to add separation. When you quickly scroll through the file, you will find it is much easier to navigate.
- Comments
Use comments liberally—they are signposts in a wilderness of declarations. First, place a comment at the top of each file that explains the purpose of the DTD or module, gives the version number, and provides contact information. If it is a customized frontend to a public DTD, be sure to mention the original that it is based on, give credit to the authors, and explain the changes that you made. Next, label each section and subsection of the DTD.
Anywhere a comment might help to clarify the use of the DTD or explain your decisions, add one. As you modify the DTD, add new comments describing your changes. Comments are part of documentation, and unclear or outdated documentation can be worse than useless.
- Version tracking
As with software, your DTD is likely to be updated as your requirements change. You should keep track of versions by numbering them; to avoid confusion, it’s important to change the version number when you make a change to the document. By convention, the first complete public release is 1.0. After that, small changes earn decimal increments: 1.1, 1.2, etc. Major changes increment by whole numbers: 2.0, 3.0, etc. Document the changes from version to version. Revision control systems are available to automate this process. On Unix-based systems, the RCS and CVS packages have both been the trusted friends of developers for years.
- Parameter entities
Parameter entities can hold recurring parts of declarations and allow you to edit them in one place. In the external subset, they can be used in element-type declarations to hold element groups and content models, or in attribute list declarations to hold attribute definitions. The internal subset is a little stricter; parameter entities can hold only complete declarations, not fragments.
For example, assume you want every element to have an optional
ID
attribute for linking and an optionalclass
attribute to assign specific role information. Parameter entities, which apply only in DTDs, look much like ordinary general entities, but have an extra % in the declaration. You can declare a parameter entity to hold common attributes like this:<!ENTITY % common.atts " id ID #IMPLIED class CDATA #IMPLIED" >
That entity can then be used in attribute list declarations:
<!ATTLIST foo %common.atts;> <!ATTLIST bar %common.atts; extra CDATA #FIXED "blah" >
Note that parameter entity references start with % rather than &.
Attributes versus elements
Making a DTD from scratch is not easy. You have to break your information down into its conceptual atoms and package it as a hierarchical structure, but it’s not always clear how to divide the information. The book model is easy, because it breaks down readily into hierarchical containers such as chapters, sections, and paragraphs. Less obvious are the models for equations, molecules, and databases. For such applications, it takes a supple mind to chop up documents into the optimal mix of elements and attributes. These tips are principles that can help you design DTDs:
Choose names that make sense. If your document is composed exclusively of elements like
thing
,object
, andchunk
, it’s going to be nearly impossible to figure out what’s what. Names should closely match the logical purpose of an element. It’s better to create specific elements for different tasks than to overload a few elements to handle many different situations. For example, theDIV
andSPAN
HTML elements aren’t ideal because they serve many different roles.Hierarchy adds information. A newspaper has articles that contain paragraphs and heads. Containers create boundaries to make it easier to write stylesheets and processing applications. And they have an implied ownership that provides convenient handles and navigation aids for processors. Containers add depth, another dimension to increase the amount of structure.
Strive for a tree structure that resembles a wide, bushy shrub. If you go too deep, the markup begins to overwhelm the content and it becomes harder to edit a document; too shallow and the information content is diluted. Think of documents and their parts as nested boxes. A big box filled with a million tiny boxes is much harder to work with than a box with a few medium boxes, and smaller boxes inside those, and so on.
Know when to use elements over attributes. An element holds content that is part of your document. An attribute modifies the behavior of an element. The trick is to find a balance between using general elements with attributes to specify purpose and creating an element for every single contingency.
Modularization
There are advantages to splitting a monolithic DTD into smaller components, or modules. The first benefit is that a modularized DTD can be easier to maintain, for reasons of organization mentioned earlier and because parts can be edited separately or “turned off” for debugging purposes. Also, the DTD becomes configurable. Modules in separate files can be swapped with others as easily as redefining a single parameter entity. Even within the same file, they can be marked for inclusion or exclusion.
XML provides two ways to modularize your DTD. The first is to store parts in separate files, then import them with external parameter entities. The second is to use a syntactic device called a conditional section . Both are powerful ways to make a DTD more flexible.
Importing modules from external sources
A DTD does not have to be stored in a single file. In fact, it often makes sense to store it in multiple files. You may wish to borrow from someone else, importing their DTD into your own as a subset. Or you may just want to make the DTD a little neater by separating pieces into different files.
To import whole DTDs or parts of DTDs, use an external parameter entity. Here is an example of a complete DTD that imports its pieces from various modules:
<!ELEMENT catalog (title, metadata, front, entries+)> <!ENTITY % basic.stuff SYSTEM "basics.mod"> %basic.stuff; <!ENTITY % front.matter SYSTEM "front.mod"> %front.matter; <!ENTITY % metadata PUBLIC "-//Standards Stuff//DTD Metadata v3.2//EN" "http://www.standards-stuff.org/dtds/metadata.dtd"> %metadata;
This DTD has two local components, which are specified by system identifiers. Each component has a .mod filename extension, which is a traditional way to show that a file contains declarations but should not be used as a DTD on its own. The last component is a DTD that can stand on its own; in fact, in this example, it’s a public resource.
There is one potential problem with importing DTD text. An external parameter entity imports all the text in a file, not just a part of it. You get all the declarations, not just a few select ones. Worse, there is no concept of local scope, in which declarations in the local DTD automatically override those in the imported file. The declarations are assembled into one logical entity, and any information about what was imported from where is lost before the DTD is parsed.
There are a few ways to get around this problem. You can override entity declarations by redeclaring them or, to be more precise, predeclaring them. In other words, if an entity is declared more than once, the first declaration will take precedence. So you can override any entity declaration with a declaration in the internal subset of your document, since the internal subset is read before the external subset.
Overriding an element declaration is more difficult. It is a
validity error to declare an element more than once. (You can make
multiple ATTLIST
declarations for
the same element, and the first one is accepted as the right one.)
So, the question is, how can you override a declaration such as
this:
<!ELEMENT polyhedron (side+, angle+)>
with a declaration of your own like this:
<!ELEMENT polyhedron (side, side, side+, angle, angle, angle+)>
To be able to override element and attribute declarations is not possible with what you know so far. I need to introduce you a new syntactic construct called the conditional section.
Conditional sections
A conditional section is a special form of markup used in a DTD to mark a region of text for inclusion or exclusion in the DTD.[4] If you anticipate that a piece of your DTD may someday be an unwanted option, you can make it a conditional section and let the end user decide whether to keep it or not. Note that conditional sections can be used only in external subsets, not internal subsets.
Conditional sections look similar to CDATA sections. They use
the square bracket delimiters, but the CDATA
keyword is replaced with either
INCLUDE
or IGNORE
. The syntax is like this:
<![switch
[DTD text
]]>
where switch
is like an on/off
switch, activating the DTD text
if its
value is INCLUDE
, or marking it
inactive if it’s set to IGNORE
.
For example:
<![INCLUDE[ <!-- these declarations will be included --> <!ELEMENT foo (bar, caz, bub?)> <!ATTLIST foo crud CDATA #IMPLIED)> ]]> <![IGNORE[ <!-- these declarations will be ignored --> <!ELEMENT blah #PCDATA> <!ELEMENT glop (flub|zuc) 'zuc')> ]]>
Using the hardcoded literals INCLUDE
and IGNORE
isn’t all that useful, since you
have to edit each conditional section manually to flip the switch.
Usually, the switch is a parameter entity, which can be defined
anywhere:
<!ENTITY % optional.stuff "INCLUDE"> <![%optional.stuff;[ <!-- these declarations may or may not be included --> <!ELEMENT foo (bar, caz, bub?)> <!ATTLIST foo crud CDATA #IMPLIED)> ]]>
Because the parameter entity optional.stuff
is defined with the keyword
INCLUDE
, the declarations in the
marked section will be used. If optional.stuff
had been defined to be
IGNORE
, the declarations would
have been ignored in the document.
This technique is especially powerful when you declare the
entity inside a document subset. In the next example, our DTD
declares a general entity that is called disclaimer
. The actual value of the entity
depends on whether use-disclaimer
has been set to INCLUDE
:
<![%use-disclaimer;[ <!ENTITY disclaimer "<p>This is Beta software. We can't promise it is free of bugs.</p>"> ]]> <!ENTITY disclaimer "">
In documents where you want to include a disclaimer, it’s a simple step to declare the switching entity in the internal subset:
<?xml version="1.0"?> <!DOCTYPE manual SYSTEM "manual.dtd" [ <!ENTITY % use-disclaimer "IGNORE"> ]> <manual> <title>User Guide for Techno-Wuzzy</title> &disclaimer; ...
In this example, the entity use-disclaimer
is set to IGNORE
, so the disclaimer
is declared as an empty string
and the document’s text will not contain a disclaimer. This is a
simple example of customizing a DTD using conditional sections and
parameter entities.
Now, returning to our previous problem of overriding element or attribute declarations, here is how to do it with conditional sections. First, the DTD must be written to allow parameter entity switching:
<!ENTITY % default.polyhedron "INCLUDE"> <![%default.polyhedron;[ <!ELEMENT polyhedron (side+, angle+)> ]]>
Now, in your document, you declare this DTD as your external
subset, then redeclare the parameter entity default.polyhedron
in the internal
subset:
<!DOCTYPE picture SYSTEM "shapes.dtd" [ <!ENTITY % default.polyhedron "IGNORE"> <!ELEMENT polyhedron (side, side, side+, angle, angle, angle+)> ]>
Since the internal subset is read before the external subset,
the parameter entity declaration here takes precedence over the one
in the DTD. The conditional section in the DTD will get a value of
IGNORE
, masking the external
element declaration for polyhedron
. The element declaration in the
internal subset is valid and used by the parser.
Conditional sections can be nested, but outer sections
override inner ones. So if the outer section is set to IGNORE
, its contents (including any
conditional sections inside it) are completely turned off regardless
of their values. For example:
<![INCLUDE[ <!-- text in here will be included --> <![IGNORE[ <!-- text in here will be ignored --> ]]> ]]> <![IGNORE[ <!-- text in here will be ignored --> <![INCLUDE[ <!-- Warning: this stuff will be ignored too! --> ]]> ]]>
Public DTDs often make heavy use of conditional sections to allow the maximum level of customization. For example, the DocBook XML DTD Version 1.0 includes the following:
<!ENTITY % screenshot.content.module "INCLUDE"> <![%screenshot.content.module;[ <!ENTITY % screenshot.module "INCLUDE"> <![%screenshot.module;[ <!ENTITY % local.screenshot.attrib ""> <!ENTITY % screenshot.role.attrib "%role.attrib;"> <!ELEMENT screenshot (screeninfo?, (graphic|graphicco))> <!ATTLIST screenshot %common.attrib; %screenshot.role.attrib; %local.screenshot.attrib; > <!--end of screenshot.module-->]]> <!ENTITY % screeninfo.module "INCLUDE"> <![%screeninfo.module;[ <!ENTITY % local.screeninfo.attrib ""> <!ENTITY % screeninfo.role.attrib "%role.attrib;"> <!ELEMENT screeninfo (%para.char.mix;)*> <!ATTLIST screeninfo %common.attrib; %screeninfo.role.attrib; %local.screeninfo.attrib; > <!--end of screeninfo.module-->]]> <!--end of screenshot.content.module-->]]>
The outermost conditional section surrounds declarations for
screenshot
and also screeninfo
, which occurs inside it. You
can completely eliminate both screenshot
and screeninfo
by setting screenshot.content.module
to IGNORE
in your local DTD before the file
is loaded. Alternatively, you can turn off only the section around
the screeninfo
declarations,
perhaps to declare your own version of screeninfo
. (Turning off the declarations
for an element in the imported file avoids warnings from your parser
about redundant declarations.) Notice that there are parameter
entities to assign various kinds of content and attribute
definitions, such as %common.attrib;
. There are also hooks for
inserting attributes of your own, such as %local.screenshot.attrib;
.
Skillful use of conditional sections can make a DTD extremely flexible, although it may become harder to read. You should use them sparingly in your personal DTDs and try to design them to fit your needs from the beginning. Later, if the DTD becomes a public resource, it will make sense to add conditional sections to allow end user customization.
Using the internal subset
Recall from Section 4.2.2 earlier in this chapter that the internal subset is the part of an XML document that can contain entity declarations. Actually, it’s more powerful than that: you can put any declarations that would appear in a DTD into the internal subset. The only things that are restricted are conditional sections (can’t use them) and parameter entities (they can hold only complete declarations, not fragments). This is useful for overriding or turning on or off parts of the DTD. Here’s the general form:
<!DOCTYPEroot-element
URI
[declarations
]>
When a parser reads the DTD, it reads the internal subset first, then the external subset. This is important because the first declaration of an entity takes precedence over all other declarations of that entity. So you can override entity declarations in the DTD by declaring them in the internal subset. New elements and attributes can be declared in the internal subset, but you may not override existing declarations in the DTD. Recall that the mechanism for redefining an element or attribute is to use a parameter entity to turn off a conditional section containing the DTD’s declaration.
This example shows some correct uses of the internal subset:
<!DOCTYPE inventory SYSTEM "InventoryReport.dtd" [ <!-- add a new "category" attribute to the item element --> <!ATTLIST item category (screw | bolt | nut) #REQUIRED> <!-- redefine the general entity companyname --> <!ENTITY companyname "Crunchy Biscuits Inc."> <!-- redefine the <price> element by redefining the price.module parameter entity --> <!ELEMENT price (currency, amount)> <!ENTITY % price.module "IGNORE"> <!-- use a different module for figures than what the DTD uses --> <!ENTITY % figs SYSTEM "myfigs.mod"> ]>
The attribute list declaration in this internal subset adds
the attribute category
to the set
of attributes for item
. As long
as the DTD doesn’t also declare a category
attribute for item
, this is okay.
The element declaration here clashes with a declaration
already in the DTD. However, the next line switches off a
conditional section by declaring the parameter entity price.module
to be IGNORE
. So the DTD’s declaration will be
hidden from the parser.
The last declaration overrides an external parameter entity in the DTD that imports a module, causing it to load the file myfigs.mod instead.
SimpleDoc: A Narrative Example
In Section 4.2.3 we developed a simple DTD for a data markup language. Narrative applications tend to be a little more complex, since there is more to human languages than simple data structures. Let’s experiment now with a DTD for a more complex, narrative application.
Inspired by DocBook, I’ve created a small, narrative application called SimpleDoc. It’s much smaller and doesn’t attempt to do even a fraction of what DocBook can do, but it touches on all the major concepts and so is suitable for pedagogical purposes. Specifically, the goal of SimpleDoc is to mark up small, simple documents such as the one in Example 4-3.
<?xml version="1.0"?> <!DOCTYPE doc SYSTEM "simpledoc.dtd"> <doc> <title>Organism or Machine?</title> <section id="diner"> <title>Sam's Diner</title> <para>A huge truck passed by, eating up four whole lanes with its girth. The whole back section was a glitzy passenger compartment trimmed in chrome and neon. The roof sprouted a giant image of a hamburger with flashing lights and the words, "Sam's Scruvi Soul Snax Shac". As it sped past at foolhardy speed, I saw a bevy of cars roped to the back, swerving back and forth.</para> <para>Included among these were:</para> <list> <listitem><para>a diesel-powered unicycle,</para></listitem> <listitem><para>a stretch limousine about 50 yards long,</para></listitem> <listitem><para>and the cutest little pod-cars shaped like spheres, with caterpillar tracks on the bottoms.</para></listitem> </list> <para>I made to intercept the truck, to hitch up my vehicle and climb aboard.</para> <note> <para>If you want to chain up your car to a moving truck, you had better know what you are doing.</para> </note> </section> </doc>
Example 4-4 is the SimpleDoc DTD.
<!-- SimpleDoc DTD --> <!-- =========================================================================== Parameter Entities =========================================================================== --> <!-- Attributes used in all elements --> <!ENTITY % common.atts " id ID #IMPLIED class CDATA #IMPLIED xml:space (default | preserve) 'default' "> <!-- Inline elements --> <!-- Block and complex elements --> <!ENTITY % block.group " author | blockquote | codelisting | example | figure | graphic | list | note | para | remark "> <!ENTITY % inline.group " acronym | citation | command | date | emphasis | filename | firstterm | literal | quote | ulink | xref "> <!-- =========================================================================== Hierarchical Elements =========================================================================== --> <!-- The document element --> <!ELEMENT doc (title, (%block.group)*, section+)> <!ATTLIST doc %common.atts;> <!-- Section to break up the document --> <!ELEMENT section (title, (%block.group)*, section*)> <!ATTLIST section %common.atts;> <!-- =========================================================================== Block Elements =========================================================================== --> <!-- place to put the author's name --> <!ELEMENT author #PCDATA> <!ATTLIST author %common.atts;> <!-- region of quoted text --> <!ELEMENT blockquote (para+)> <!ATTLIST blockquote %common.atts;> <!-- formal codelisting (adds title) --> <!ELEMENT example (title, codelisting)> <!ATTLIST example %common.atts;> <!-- formal picture (adds title) --> <!ELEMENT figure (title, graphic)> <!ATTLIST figure %common.atts;> <!-- out-of-flow note --> <!ELEMENT footnote (para+)> <!ATTLIST footnote %common.atts;> <!-- picture --> <!ELEMENT graphic EMPTY> <!ATTLIST graphic fileref CDATA #REQUIRED %common.atts; > <!-- sequence of items --> <!ELEMENT list (term?, listitem)+> <!ATTLIST list type (numbered|bulleted|definition) "numbered" %common.atts; > <!-- component of a list --> <!ELEMENT listitem (%block.group;)+> <!ATTLIST listitem %common.atts;> <!-- in-flow note --> <!ELEMENT note (para+)> <!ATTLIST note %common.atts;> <!-- basic paragraph --> <!ELEMENT para (#PCDATA | %inline.group; | footnote)*> <!ATTLIST para %common.atts;> <!-- code listing --> <!ELEMENT codelisting (#PCDATA | %inline.group;)*> <!ATTLIST codelisting xml:space (preserve) #FIXED 'preserve' %common.atts; > <!-- visible comment --> <!ELEMENT remark (#PCDATA | %inline.group;)*> <!ATTLIST remark %common.atts;> <!-- document or section label --> <!ELEMENT title (#PCDATA | %inline.group;)*> <!ATTLIST title %common.atts;> <!-- term in a definition list --> <!ELEMENT term (#PCDATA | %inline.group;)*> <!ATTLIST term %common.atts;> <!-- =========================================================================== Inline Elements =========================================================================== --> <!ENTITY % inline.content "#PCDATA"> <!ELEMENT acronym %inline.content;> <!ATTLIST acronym %common.atts;> <!ELEMENT citation %inline.content;> <!ATTLIST citation %common.atts;> <!ELEMENT command %inline.content;> <!ATTLIST command %common.atts;> <!ELEMENT date %inline.content;> <!ATTLIST date %common.atts;> <!ELEMENT emphasis %inline.content;> <!ATTLIST emphasis %common.atts;> <!ELEMENT filename %inline.content;> <!ATTLIST filename %common.atts;> <!ELEMENT firstterm %inline.content;> <!ATTLIST firstterm %common.atts;> <!ELEMENT literal %inline.content;> <!ATTLIST literal %common.atts;> <!ELEMENT quote %inline.content;> <!ATTLIST quote %common.atts;> <!ELEMENT ulink %inline.content;> <!ATTLIST ulink href CDATA #REQUIRED %common.atts; > <!ELEMENT xref EMPTY> <!ATTLIST xref linkend ID #REQUIRED %common.atts; > <!-- =========================================================================== Useful Entities =========================================================================== --> <!ENTITY % isolat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN//XML" "isolat1.ent" > %isolat1; <!ENTITY % isolat2 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 2//EN//XML" "isolat2.ent" > %isolat2; <!ENTITY % isomath PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Ordinary//EN//XML" "isoamso.ent" > %isomath; <!ENTITY % isodia PUBLIC "ISO 8879:1986//ENTITIES Diacritical Marks//EN//XML" "isodia.ent" > %isodia; <!ENTITY % isogreek PUBLIC "ISO 8879:1986//ENTITIES Greek Symbols//EN//XML" "isogrk3.ent" > %isogreek;
[2] Entity declarations are the only kind of declaration that can appear redundantly without triggering a validity error. If an element type is declared more than once, it will render the DTD (and any documents that use it) invalid.
[3] Whitespace is allowed to make the markup more readable, but would be ignored for the purpose of validation.
[4] In SGML, you can use conditional sections in documents as well as in DTDs. XML restricts its use to DTDs only. I personally miss them because I think they are a very powerful way to conditionally alter documents.
Get Learning XML, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.