DTDs

The original XML document model is the Document Type Definition (DTD). DTDs actually predate XML; they are a reduced hand-me-down from SGML with the core syntax almost completely intact. The following describes how a DTD defines a document type.

  • A DTD declares a set of allowed elements. You cannot use any element names other than those in this set. Think of this as the “vocabulary” of the language.

  • A DTD defines a content model for each element. The content model is a pattern that tells what elements or data can go inside an element, in what order, in what number, and whether they are required or optional. Think of this as the “grammar” of the language.

  • A DTD declares a set of allowed attributes for each element. Each attribute declaration defines the name, datatype, default values (if any), and behavior (e.g., if it is required or optional) of the attribute.

  • A DTD provides a variety of mechanisms to make managing the model easier, for example, the use of parameter entities and the ability to import pieces of the model from an external file.

Document Prolog

According to the XML Recommendation, all external parsed entities (including DTDs) should begin with a text declaration. It looks like an XML declaration except that it explicitly excludes the standalone property. If you need to specify a character set other than the default UTF-8 (see Chapter 9 for more about character sets), or to change the XML version number from the default 1.0, this is where you would do it.

Tip

If you specify a character set in the DTD, it won’t automatically carry over into XML documents that use the DTD. XML documents have to specify their own encodings in their document prologs.

After the text declaration, the resemblance to normal document prologs ends. External parsed entities, including DTDs, must not contain a document type declaration.

Declarations

A DTD is a set of rules or declarations. Each declaration adds a new element, set of attributes, entity, or notation to the language you are describing. DTDs can be combined using parameter entities, a technique called modularization . You can also add declarations inside the internal subset of the document.

The order of the declarations is important in two situations. First, if there are redundant entity declarations, the first one that appears takes precedence and all others are ignored.[2] This is important to know if you are going to override declarations, either in the internal subset or by cascading DTDs. Second, if parameter entities are used in declarations, they must be declared before they are used as references.

Declaration syntax is flexible when it comes to whitespace. You can add extra space anywhere except in the string of characters at the beginning that identifies the declaration type.

For example, these are all acceptable:

<!ELEMENT         thingie      ALL>
<!ELEMENT
  thingie
  ALL>
<!ELEMENT thingie (          foo      |
                             bar      |
                             zap      )*>

An Example

Imagine a scenario where you are collecting information from a group of people. The data you receive will be fed to a program that will process it and store it in a database. You need a quick way to determine whether all the required information is there before you can accept a submission. For this, we will use a DTD.

The information in this example will be census data. Your staff is roaming around the neighborhood interviewing families and entering data on their laptop computers. They are using an XML editor configured with a DTD that you’ve created to model your language, CensusML. Later, they will upload all the CensusML documents to the central repository to be processed overnight.

Example 4-1 shows how a typical valid CensusML document should look, minus the document prolog. A document represents an interview with one family. It contains a date, an address, and a list of people residing there. For each person, we are interested in taking their full name, age, employment status, and gender. We also use identification numbers for people, to ensure that we don’t accidentally enter in somebody more than once.

Example 4-1. A typical CensusML document
<census-record taker="3163">
  <date><year>2003</year><month>10</month><day>11</day></date>
  <address>
    <street>471 Skipstone Lane <unit>4-A</unit></street>
    <city>Emerald City</city>
    <county>Greenhill</county>
    <country>Oz</country>
    <postalcode>885JKL</postalcode>
  </address>
  <person employed="fulltime" pid="P270405">
    <name>
      <first>Meeble</first>
      <last>Bigbug</last>
      <junior/>
    </name>
    <age>39</age>
    <gender>male</gender>
  </person>
  <person employed="parttime" pid="P273882">
    <name>
      <first>Mable</first>
      <last>Bigbug</last>
    </name>
    <age>36</age>
    <gender>female</gender>
  </person>
  <person pid="P472891">
    <name>
      <first>Marble</first>
      <last>Bigbug</last>
    </name>
    <age>11</age>
    <gender>male</gender>
  </person>
</census-record>

Let’s start putting together a DTD. The first declaration is for the document element:

<!ELEMENT census-record
(date, address, person+)>

This establishes the first rules for the CensusML language: (1) there is an element named census-record and (2) it must contains one date element, one address element, and at least one person element. If you leave any of these elements out, or put them in a different order, the document will be invalid.

Note that the declaration doesn’t actually specify that the census-record must be used as the document element. In fact, a DTD can’t single out any element to be the root of a document. You might view this as a bad thing, since you can’t stop someone from submitting an incomplete document containing only a person element and nothing else. On the other hand, you could see it as a feature, where DTDs can contain more than one model for a document. For example, DocBook relies on this to support many different models for documents; a book would use the book element as its root, while an article would use the article element. In any case, be aware of this loophole.

Now we should declare the attributes for this element. There is only one, taker, identifying the census taker who authored this document. Its type is CDATA (character data). We will make it required, because it’s important to know who is submitting the data just to make sure no mischievous people submit fraudulent records. Here is the attribute list for census-record:

<!ATTLIST census-record
  taker   CDATA   #REQUIRED>

Next declare the date element. The order of element declarations doesn’t really matter. All the declarations are read into the parser’s memory before any validation takes place, so all that is necessary is that every element is accounted for. But I like things organized and in the approximate order, so here’s the next set of declarations:

<!ELEMENT date (year, month, day)>
<!ELEMENT year #PCDATA>
<!ELEMENT month #PCDATA>
<!ELEMENT day #PCDATA>

The #PCDATA literal represents character data. Specifically, it matches zero or more characters. Any element with a content model #PCDATA can contain character data but not elements. So the elements year, month, and day are what you might call data fields. The date element, in contrast, must contain elements, but not character data.[3]

Now for the address bit. address is a container of elements just like date. For the most part, its subelements are plain data fields (their content is character data only), but one element has mixed content: street. Here are the declarations:

<!ELEMENT address 
  (street, city, county, country, postalcode)>
<!ELEMENT street (#PCDATA | unit)*>
<!ELEMENT city #PCDATA>
<!ELEMENT county #PCDATA>
<!ELEMENT country #PCDATA>
<!ELEMENT postalcode #PCDATA>
<!ELEMENT unit #PCDATA>

The declaration for street follows the pattern used by all mixed-content elements. The #PCDATA must come first followed by all the allowed subelements separated by vertical bars (|). The asterisk (*) here is required. It means that there can be zero or more of whatever comes before it. The upshot is that character data is optional, along with all the elements that can be interspersed within it.

Alas, there is no way to require that an element with mixed content contains character data. The census taker could just leave the street element blank and the validating parser would be happy with that. Changing that asterisk (*) to a plus (+) to require some character data is not allowed. To make validation simple and fast, DTDs never concern themselves with the actual details of character data.

Our final task is to declare the elements and attributes making up a person. Here is a crack at the element declarations:

<!ELEMENT person (name, age, gender)>
<!ELEMENT name (first, last, (junior | senior)?)>
<!ELEMENT age #PCDATA>
<!ELEMENT gender #PCDATA>
<!ELEMENT first #PCDATA>
<!ELEMENT last #PCDATA>
<!ELEMENT junior EMPTY>
<!ELEMENT senior EMPTY>
<!ATTLIST person
    pid       ID                   #REQUIRED
    employed  (fulltime|parttime)  #IMPLIED>

The content model is a little more complex for this container. The first and last names are required, but there is an option to follow these with a qualifier (“Junior” or “Senior”). The qualifiers are declared as empty elements here using the keyword EMPTY and the question mark makes them optional, as not everyone is a junior or senior. Perhaps it would be just as easy to make an attribute called qualifier with values junior or senior, but I decided to do it this way to show you how to declare empty elements. Also, using an element makes the markup less cluttered, and we already have two attributes in the container element.

The first attribute declared is a required pid, a person identification string. Its type is ID, which to validating parsers means that it is a unique identifier within the scope of the document. No other element can have an ID-type attribute with that value. This means that if the census taker accidentally puts in a person twice, the parser will catch the error and report the document invalid. The parser can only check within the scope of the document, however, so there is nothing to stop a census taker from entering the same person in another document.

ID-type attributes have another limitation. There is one identifier-space for all of them, so even if you want to use them in different ways, such as having an identifier for the address and another for people, you can’t use the same string in both element types. A solution to this might be to prefix the identifier string with a code like “HOME-38225” for address and “PID-489294” for person, effectively creating your own separate identifier spaces. Note that ID-type attributes must always begin with a letter or underscore, like XML element and attribute names.

The other attribute, employed, is optional as denoted by the #IMPLIED keyword. It’s also an enumerated type, meaning that there is a set of allowed values (fulltime and parttime). Setting the attribute to anything else would result in a validation error.

Example 4-2 shows the complete DTD.

Example 4-2. The CensusML DTD
<!--
Census Markup Language
(use <census-record> as the document element)
-->
<!ELEMENT census-record (date, address, person+)>
<!ATTLIST census-record 
  taker   CDATA   #REQUIRED>

<!-- date the info was collected -->
<!ELEMENT date (year, month, day)>
<!ELEMENT year #PCDATA>
<!ELEMENT month #PCDATA>
<!ELEMENT day #PCDATA>

<!-- address information -->
<!ELEMENT address 
  (street, city, county, country, postalcode)>
<!ELEMENT street (#PCDATA | unit)*>
<!ELEMENT city #PCDATA>
<!ELEMENT county #PCDATA>
<!ELEMENT country #PCDATA>
<!ELEMENT postalcode #PCDATA>
<!ELEMENT unit #PCDATA>

<!-- person information -->
<!ELEMENT person (name, age, gender)>
<!ELEMENT name (first, last, (junior | senior)?)>
<!ELEMENT age #PCDATA>
<!ELEMENT gender #PCDATA>
<!ELEMENT first #PCDATA>
<!ELEMENT last #PCDATA>
<!ELEMENT junior EMPTY>
<!ELEMENT senior EMPTY>
<!ATTLIST person
    pid       ID                   #REQUIRED
    employed  (fulltime|parttime)  #IMPLIED>

Tips for Designing and Customizing DTDs

DTD design and construction is part science and part art form. The basic concepts are easy enough, but managing a large DTD—maintaining hundreds of element and attribute declarations while keeping them readable and bug-free—can be a challenge. This section offers a collection of hints and best practices that you may find useful. The next section shows a concrete example that uses these practices.

Keeping it organized

DTDs are notoriously hard to read, but good organization always helps. A few extra minutes spent tidying up and writing comments can save you hours of scrutinizing later. Often a DTD is its own documentation, so if you expect others to use it, clean code is doubly important.

Organizing declarations by function

Keep declarations separated into sections by their purpose. In small DTDs, this helps you navigate the file. In larger DTDs, you might even want to break the declarations into separate modules. Some categories to group by are blocks, inlines, hierarchical elements, parts of tables, lists, etc. In Example 4-4, the declarations are divided by function (block, inline, and hierarchical).

Whitespace

Pad your declarations with lots of whitespace. Content models and attribute lists suffer from dense syntax, so spacing out the parts, even placing them on separate lines, helps make them more understandable. Indent lines inside declarations to make the delimiters more clear. Between logical divisions, use extra space and perhaps a comment with a row of dark characters to add separation. When you quickly scroll through the file, you will find it is much easier to navigate.

Comments

Use comments liberally—they are signposts in a wilderness of declarations. First, place a comment at the top of each file that explains the purpose of the DTD or module, gives the version number, and provides contact information. If it is a customized frontend to a public DTD, be sure to mention the original that it is based on, give credit to the authors, and explain the changes that you made. Next, label each section and subsection of the DTD.

Anywhere a comment might help to clarify the use of the DTD or explain your decisions, add one. As you modify the DTD, add new comments describing your changes. Comments are part of documentation, and unclear or outdated documentation can be worse than useless.

Version tracking

As with software, your DTD is likely to be updated as your requirements change. You should keep track of versions by numbering them; to avoid confusion, it’s important to change the version number when you make a change to the document. By convention, the first complete public release is 1.0. After that, small changes earn decimal increments: 1.1, 1.2, etc. Major changes increment by whole numbers: 2.0, 3.0, etc. Document the changes from version to version. Revision control systems are available to automate this process. On Unix-based systems, the RCS and CVS packages have both been the trusted friends of developers for years.

Parameter entities

Parameter entities can hold recurring parts of declarations and allow you to edit them in one place. In the external subset, they can be used in element-type declarations to hold element groups and content models, or in attribute list declarations to hold attribute definitions. The internal subset is a little stricter; parameter entities can hold only complete declarations, not fragments.

For example, assume you want every element to have an optional ID attribute for linking and an optional class attribute to assign specific role information. Parameter entities, which apply only in DTDs, look much like ordinary general entities, but have an extra % in the declaration. You can declare a parameter entity to hold common attributes like this:

<!ENTITY % common.atts "
  id        ID        #IMPLIED
  class     CDATA     #IMPLIED"
 >

That entity can then be used in attribute list declarations:

<!ATTLIST foo %common.atts;>
<!ATTLIST bar %common.atts;
   extra    CDATA     #FIXED "blah"
  >

Note that parameter entity references start with % rather than &.

Attributes versus elements

Making a DTD from scratch is not easy. You have to break your information down into its conceptual atoms and package it as a hierarchical structure, but it’s not always clear how to divide the information. The book model is easy, because it breaks down readily into hierarchical containers such as chapters, sections, and paragraphs. Less obvious are the models for equations, molecules, and databases. For such applications, it takes a supple mind to chop up documents into the optimal mix of elements and attributes. These tips are principles that can help you design DTDs:

  • Choose names that make sense. If your document is composed exclusively of elements like thing, object, and chunk, it’s going to be nearly impossible to figure out what’s what. Names should closely match the logical purpose of an element. It’s better to create specific elements for different tasks than to overload a few elements to handle many different situations. For example, the DIV and SPAN HTML elements aren’t ideal because they serve many different roles.

  • Hierarchy adds information. A newspaper has articles that contain paragraphs and heads. Containers create boundaries to make it easier to write stylesheets and processing applications. And they have an implied ownership that provides convenient handles and navigation aids for processors. Containers add depth, another dimension to increase the amount of structure.

    Strive for a tree structure that resembles a wide, bushy shrub. If you go too deep, the markup begins to overwhelm the content and it becomes harder to edit a document; too shallow and the information content is diluted. Think of documents and their parts as nested boxes. A big box filled with a million tiny boxes is much harder to work with than a box with a few medium boxes, and smaller boxes inside those, and so on.

  • Know when to use elements over attributes. An element holds content that is part of your document. An attribute modifies the behavior of an element. The trick is to find a balance between using general elements with attributes to specify purpose and creating an element for every single contingency.

Modularization

There are advantages to splitting a monolithic DTD into smaller components, or modules. The first benefit is that a modularized DTD can be easier to maintain, for reasons of organization mentioned earlier and because parts can be edited separately or “turned off” for debugging purposes. Also, the DTD becomes configurable. Modules in separate files can be swapped with others as easily as redefining a single parameter entity. Even within the same file, they can be marked for inclusion or exclusion.

XML provides two ways to modularize your DTD. The first is to store parts in separate files, then import them with external parameter entities. The second is to use a syntactic device called a conditional section . Both are powerful ways to make a DTD more flexible.

Importing modules from external sources

A DTD does not have to be stored in a single file. In fact, it often makes sense to store it in multiple files. You may wish to borrow from someone else, importing their DTD into your own as a subset. Or you may just want to make the DTD a little neater by separating pieces into different files.

To import whole DTDs or parts of DTDs, use an external parameter entity. Here is an example of a complete DTD that imports its pieces from various modules:

<!ELEMENT catalog (title, metadata, front, entries+)>
<!ENTITY % basic.stuff   SYSTEM "basics.mod">
%basic.stuff;
<!ENTITY % front.matter  SYSTEM "front.mod">
%front.matter;
<!ENTITY % metadata      PUBLIC "-//Standards Stuff//DTD Metadata 
  v3.2//EN" "http://www.standards-stuff.org/dtds/metadata.dtd">
%metadata;

This DTD has two local components, which are specified by system identifiers. Each component has a .mod filename extension, which is a traditional way to show that a file contains declarations but should not be used as a DTD on its own. The last component is a DTD that can stand on its own; in fact, in this example, it’s a public resource.

There is one potential problem with importing DTD text. An external parameter entity imports all the text in a file, not just a part of it. You get all the declarations, not just a few select ones. Worse, there is no concept of local scope, in which declarations in the local DTD automatically override those in the imported file. The declarations are assembled into one logical entity, and any information about what was imported from where is lost before the DTD is parsed.

There are a few ways to get around this problem. You can override entity declarations by redeclaring them or, to be more precise, predeclaring them. In other words, if an entity is declared more than once, the first declaration will take precedence. So you can override any entity declaration with a declaration in the internal subset of your document, since the internal subset is read before the external subset.

Overriding an element declaration is more difficult. It is a validity error to declare an element more than once. (You can make multiple ATTLIST declarations for the same element, and the first one is accepted as the right one.) So, the question is, how can you override a declaration such as this:

<!ELEMENT polyhedron (side+, angle+)>

with a declaration of your own like this:

<!ELEMENT polyhedron (side, side, side+, angle, angle, angle+)>

To be able to override element and attribute declarations is not possible with what you know so far. I need to introduce you a new syntactic construct called the conditional section.

Conditional sections

A conditional section is a special form of markup used in a DTD to mark a region of text for inclusion or exclusion in the DTD.[4] If you anticipate that a piece of your DTD may someday be an unwanted option, you can make it a conditional section and let the end user decide whether to keep it or not. Note that conditional sections can be used only in external subsets, not internal subsets.

Conditional sections look similar to CDATA sections. They use the square bracket delimiters, but the CDATA keyword is replaced with either INCLUDE or IGNORE. The syntax is like this:

<![switch[DTD text]]>

where switch is like an on/off switch, activating the DTD text if its value is INCLUDE, or marking it inactive if it’s set to IGNORE. For example:

<![INCLUDE[
<!-- these declarations will be included -->
<!ELEMENT foo (bar, caz, bub?)>
<!ATTLIST foo crud CDATA #IMPLIED)>
]]>
<![IGNORE[
<!-- these declarations will be ignored -->
<!ELEMENT blah #PCDATA>
<!ELEMENT glop (flub|zuc) 'zuc')>
]]>

Using the hardcoded literals INCLUDE and IGNORE isn’t all that useful, since you have to edit each conditional section manually to flip the switch. Usually, the switch is a parameter entity, which can be defined anywhere:

<!ENTITY % optional.stuff "INCLUDE">
<![%optional.stuff;[
<!-- these declarations may or may not be included -->
<!ELEMENT foo (bar, caz, bub?)>
<!ATTLIST foo crud CDATA #IMPLIED)>
]]>

Because the parameter entity optional.stuff is defined with the keyword INCLUDE, the declarations in the marked section will be used. If optional.stuff had been defined to be IGNORE, the declarations would have been ignored in the document.

This technique is especially powerful when you declare the entity inside a document subset. In the next example, our DTD declares a general entity that is called disclaimer. The actual value of the entity depends on whether use-disclaimer has been set to INCLUDE:

<![%use-disclaimer;[
  <!ENTITY disclaimer "<p>This is Beta software. We can't promise it
  is free of bugs.</p>">
]]>
<!ENTITY disclaimer "">

In documents where you want to include a disclaimer, it’s a simple step to declare the switching entity in the internal subset:

<?xml version="1.0"?>
<!DOCTYPE manual SYSTEM "manual.dtd" [
  <!ENTITY % use-disclaimer "IGNORE">
]>

<manual>
  <title>User Guide for Techno-Wuzzy</title>

  &disclaimer;
  ...

In this example, the entity use-disclaimer is set to IGNORE, so the disclaimer is declared as an empty string and the document’s text will not contain a disclaimer. This is a simple example of customizing a DTD using conditional sections and parameter entities.

Now, returning to our previous problem of overriding element or attribute declarations, here is how to do it with conditional sections. First, the DTD must be written to allow parameter entity switching:

<!ENTITY % default.polyhedron "INCLUDE">
<![%default.polyhedron;[
<!ELEMENT polyhedron (side+, angle+)>
]]>

Now, in your document, you declare this DTD as your external subset, then redeclare the parameter entity default.polyhedron in the internal subset:

<!DOCTYPE picture SYSTEM "shapes.dtd" [
  <!ENTITY % default.polyhedron "IGNORE">
  <!ELEMENT polyhedron (side, side, side+, angle, angle, angle+)>
]>

Since the internal subset is read before the external subset, the parameter entity declaration here takes precedence over the one in the DTD. The conditional section in the DTD will get a value of IGNORE, masking the external element declaration for polyhedron. The element declaration in the internal subset is valid and used by the parser.

Conditional sections can be nested, but outer sections override inner ones. So if the outer section is set to IGNORE, its contents (including any conditional sections inside it) are completely turned off regardless of their values. For example:

<![INCLUDE[
<!-- text in here will be included -->
  <![IGNORE[
  <!-- text in here will be ignored -->
  ]]>
]]>
<![IGNORE[
<!-- text in here will be ignored -->
  <![INCLUDE[
  <!-- Warning: this stuff will be ignored too! -->
  ]]>
]]>

Public DTDs often make heavy use of conditional sections to allow the maximum level of customization. For example, the DocBook XML DTD Version 1.0 includes the following:

<!ENTITY % screenshot.content.module "INCLUDE">
<![%screenshot.content.module;[
<!ENTITY % screenshot.module "INCLUDE">
<![%screenshot.module;[
<!ENTITY % local.screenshot.attrib "">
<!ENTITY % screenshot.role.attrib "%role.attrib;">
<!ELEMENT screenshot (screeninfo?, (graphic|graphicco))>
<!ATTLIST screenshot
                %common.attrib;
                %screenshot.role.attrib;
                %local.screenshot.attrib;
>
<!--end of screenshot.module-->]]>

<!ENTITY % screeninfo.module "INCLUDE">
<![%screeninfo.module;[
<!ENTITY % local.screeninfo.attrib "">
<!ENTITY % screeninfo.role.attrib "%role.attrib;">
<!ELEMENT screeninfo (%para.char.mix;)*>
<!ATTLIST screeninfo
                %common.attrib;
                %screeninfo.role.attrib;
                %local.screeninfo.attrib;
>
<!--end of screeninfo.module-->]]>
<!--end of screenshot.content.module-->]]>

The outermost conditional section surrounds declarations for screenshot and also screeninfo, which occurs inside it. You can completely eliminate both screenshot and screeninfo by setting screenshot.content.module to IGNORE in your local DTD before the file is loaded. Alternatively, you can turn off only the section around the screeninfo declarations, perhaps to declare your own version of screeninfo. (Turning off the declarations for an element in the imported file avoids warnings from your parser about redundant declarations.) Notice that there are parameter entities to assign various kinds of content and attribute definitions, such as %common.attrib;. There are also hooks for inserting attributes of your own, such as %local.screenshot.attrib;.

Skillful use of conditional sections can make a DTD extremely flexible, although it may become harder to read. You should use them sparingly in your personal DTDs and try to design them to fit your needs from the beginning. Later, if the DTD becomes a public resource, it will make sense to add conditional sections to allow end user customization.

Using the internal subset

Recall from Section 4.2.2 earlier in this chapter that the internal subset is the part of an XML document that can contain entity declarations. Actually, it’s more powerful than that: you can put any declarations that would appear in a DTD into the internal subset. The only things that are restricted are conditional sections (can’t use them) and parameter entities (they can hold only complete declarations, not fragments). This is useful for overriding or turning on or off parts of the DTD. Here’s the general form:

<!DOCTYPE root-element 
                  URI [ declarations ]>

When a parser reads the DTD, it reads the internal subset first, then the external subset. This is important because the first declaration of an entity takes precedence over all other declarations of that entity. So you can override entity declarations in the DTD by declaring them in the internal subset. New elements and attributes can be declared in the internal subset, but you may not override existing declarations in the DTD. Recall that the mechanism for redefining an element or attribute is to use a parameter entity to turn off a conditional section containing the DTD’s declaration.

This example shows some correct uses of the internal subset:

<!DOCTYPE inventory SYSTEM "InventoryReport.dtd" [

<!-- add a new "category" attribute to the item element -->
<!ATTLIST item category (screw | bolt | nut) #REQUIRED>

<!-- redefine the general entity companyname -->
<!ENTITY companyname "Crunchy Biscuits Inc.">

<!-- redefine the <price> element by redefining the price.module 
     parameter entity -->
<!ELEMENT price (currency, amount)>
<!ENTITY % price.module "IGNORE">

<!-- use a different module for figures than what the DTD uses -->
<!ENTITY % figs SYSTEM "myfigs.mod">
]>

The attribute list declaration in this internal subset adds the attribute category to the set of attributes for item. As long as the DTD doesn’t also declare a category attribute for item, this is okay.

The element declaration here clashes with a declaration already in the DTD. However, the next line switches off a conditional section by declaring the parameter entity price.module to be IGNORE. So the DTD’s declaration will be hidden from the parser.

The last declaration overrides an external parameter entity in the DTD that imports a module, causing it to load the file myfigs.mod instead.

Warning

You’re only allowed to declare an element once in a DTD, so while you can override declarations for attributes, don’t declare an element in the internal subset if it’s already declared elsewhere.

SimpleDoc: A Narrative Example

In Section 4.2.3 we developed a simple DTD for a data markup language. Narrative applications tend to be a little more complex, since there is more to human languages than simple data structures. Let’s experiment now with a DTD for a more complex, narrative application.

Inspired by DocBook, I’ve created a small, narrative application called SimpleDoc. It’s much smaller and doesn’t attempt to do even a fraction of what DocBook can do, but it touches on all the major concepts and so is suitable for pedagogical purposes. Specifically, the goal of SimpleDoc is to mark up small, simple documents such as the one in Example 4-3.

Example 4-3. A sample SimpleDoc document
<?xml version="1.0"?>
<!DOCTYPE doc SYSTEM "simpledoc.dtd">
<doc>
  <title>Organism or Machine?</title>
  <section id="diner">
    <title>Sam's Diner</title>
    <para>A huge truck passed by, eating up four whole lanes with its girth.
The whole back section was a glitzy passenger compartment trimmed in
chrome and neon. The roof sprouted a giant image of a hamburger with
flashing lights and the words, "Sam's Scruvi Soul Snax Shac". As it
sped past at foolhardy speed, I saw a bevy of cars roped to the back,
swerving back and forth.</para>
    <para>Included among these were:</para>
    <list>
      <listitem><para>a diesel-powered unicycle,</para></listitem>
      <listitem><para>a stretch limousine about 50 yards
      long,</para></listitem>
      <listitem><para>and the cutest little pod-cars shaped like spheres,
      with caterpillar tracks on the bottoms.</para></listitem>
    </list>
    <para>I made to intercept the truck, to hitch up my vehicle and climb
aboard.</para>
    <note>
      <para>If you want to chain up your car to a moving truck, you had better
know what you are doing.</para>
    </note>
  </section>
</doc>

Example 4-4 is the SimpleDoc DTD.

Example 4-4. The SimpleDoc DTD
<!--
SimpleDoc DTD
-->

<!-- ===========================================================================
               Parameter Entities
     =========================================================================== -->

<!-- Attributes used in all elements -->
<!ENTITY % common.atts "
        id        ID        #IMPLIED
        class     CDATA     #IMPLIED
        xml:space (default | preserve) 'default'
">

<!-- Inline elements -->
<!-- Block and complex elements -->
<!ENTITY % block.group "
          author
        | blockquote
        | codelisting
        | example
        | figure
        | graphic
        | list
        | note
        | para
        | remark
">

<!ENTITY % inline.group "
          acronym
        | citation
        | command
        | date
        | emphasis
        | filename
        | firstterm
        | literal
        | quote
        | ulink
        | xref
">

<!-- ===========================================================================
               Hierarchical Elements
     =========================================================================== -->

<!-- The document element -->
<!ELEMENT doc (title, (%block.group)*, section+)>
<!ATTLIST doc %common.atts;>

<!-- Section to break up the document -->
<!ELEMENT section (title, (%block.group)*, section*)>
<!ATTLIST section %common.atts;>

<!-- ===========================================================================
                 Block Elements
     =========================================================================== -->

<!-- place to put the author's name -->
<!ELEMENT author #PCDATA>
<!ATTLIST author %common.atts;>

<!-- region of quoted text -->
<!ELEMENT blockquote (para+)>
<!ATTLIST blockquote %common.atts;>

<!-- formal codelisting (adds title) -->
<!ELEMENT example (title, codelisting)>
<!ATTLIST example %common.atts;>

<!-- formal picture (adds title) -->
<!ELEMENT figure (title, graphic)>
<!ATTLIST figure %common.atts;>

<!-- out-of-flow note -->
<!ELEMENT footnote (para+)>
<!ATTLIST footnote %common.atts;>

<!-- picture -->
<!ELEMENT graphic EMPTY>
<!ATTLIST graphic
        fileref   CDATA     #REQUIRED
        %common.atts;
>

<!-- sequence of items -->
<!ELEMENT list (term?, listitem)+>
<!ATTLIST list
    type      (numbered|bulleted|definition)      "numbered"
    %common.atts;
>

<!-- component of a list -->
<!ELEMENT listitem (%block.group;)+>
<!ATTLIST listitem %common.atts;>

<!-- in-flow note -->
<!ELEMENT note (para+)>
<!ATTLIST note %common.atts;>

<!-- basic paragraph -->
<!ELEMENT para (#PCDATA | %inline.group; | footnote)*>
<!ATTLIST para %common.atts;>

<!-- code listing -->
<!ELEMENT codelisting (#PCDATA | %inline.group;)*>
<!ATTLIST codelisting
    xml:space (preserve) #FIXED 'preserve'
    %common.atts;
>

<!-- visible comment -->
<!ELEMENT remark (#PCDATA | %inline.group;)*>
<!ATTLIST remark %common.atts;>

<!-- document or section label -->
<!ELEMENT title (#PCDATA | %inline.group;)*>
<!ATTLIST title %common.atts;>

<!-- term in a definition list -->
<!ELEMENT term (#PCDATA | %inline.group;)*>
<!ATTLIST term %common.atts;>

<!-- ===========================================================================
                 Inline Elements
     =========================================================================== -->

<!ENTITY % inline.content "#PCDATA">

<!ELEMENT acronym %inline.content;>
<!ATTLIST acronym %common.atts;>

<!ELEMENT citation %inline.content;>
<!ATTLIST citation %common.atts;>

<!ELEMENT command %inline.content;>
<!ATTLIST command %common.atts;>

<!ELEMENT date %inline.content;>
<!ATTLIST date %common.atts;>

<!ELEMENT emphasis %inline.content;>
<!ATTLIST emphasis %common.atts;>

<!ELEMENT filename %inline.content;>
<!ATTLIST filename %common.atts;>

<!ELEMENT firstterm %inline.content;>
<!ATTLIST firstterm %common.atts;>

<!ELEMENT literal %inline.content;>
<!ATTLIST literal %common.atts;>

<!ELEMENT quote %inline.content;>
<!ATTLIST quote %common.atts;>

<!ELEMENT ulink %inline.content;>
<!ATTLIST ulink
        href      CDATA   #REQUIRED
        %common.atts;
>

<!ELEMENT xref EMPTY>
<!ATTLIST xref 
        linkend   ID      #REQUIRED
        %common.atts;
>

<!-- ===========================================================================
                 Useful Entities
     =========================================================================== -->

<!ENTITY % isolat1
    PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN//XML"
    "isolat1.ent"
>
%isolat1;
<!ENTITY % isolat2
    PUBLIC "ISO 8879:1986//ENTITIES Added Latin 2//EN//XML"
    "isolat2.ent"
>
%isolat2;
<!ENTITY % isomath
    PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Ordinary//EN//XML"
    "isoamso.ent"
>
%isomath;
<!ENTITY % isodia
    PUBLIC "ISO 8879:1986//ENTITIES Diacritical Marks//EN//XML"
    "isodia.ent"
>
%isodia;
<!ENTITY % isogreek
    PUBLIC "ISO 8879:1986//ENTITIES Greek Symbols//EN//XML"
    "isogrk3.ent"
>
%isogreek;


[2] Entity declarations are the only kind of declaration that can appear redundantly without triggering a validity error. If an element type is declared more than once, it will render the DTD (and any documents that use it) invalid.

[3] Whitespace is allowed to make the markup more readable, but would be ignored for the purpose of validation.

[4] In SGML, you can use conditional sections in documents as well as in DTDs. XML restricts its use to DTDs only. I personally miss them because I think they are a very powerful way to conditionally alter documents.

Get Learning XML, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.