Checking Documents for Well-Formedness

Every XML document, without exception, must be well-formed. This means it must adhere to a number of rules, including the following:

  1. Every start-tag must have a matching end-tag.

  2. Elements may nest but may not overlap.

  3. There must be exactly one root element.

  4. Attribute values must be quoted.

  5. An element may not have two attributes with the same name.

  6. Comments and processing instructions may not appear inside tags.

  7. No unescaped < or & signs may occur in the character data of an element or attribute.

This is not an exhaustive list. There are many, many ways a document can be malformed. You’ll find a complete list in Chapter 21. Some of these involve constructs that we have not yet discussed, such as DTDs. Others are extremely unlikely to occur if you follow the examples in this chapter (for example, including whitespace between the opening < and the element name in a tag).

Whether the error is small or large, likely or unlikely, an XML parser reading a document is required to report it. It may or may not report multiple well-formedness errors it detects in the document. However, the parser is not allowed to try to fix the document and make a best-faith effort of providing what it thinks the author really meant. It can’t fill in missing quotes around attribute values, insert an omitted end-tag, or ignore the comment that’s inside a start-tag. The parser is required to return an error. The objective here is to avoid the bug-for-bug compatibility wars that ...

Get XML in a Nutshell, 3rd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.