Cover | Table of Contents | Colophon
<?xml version="1.0"?> <product barcode="2394287410"> <manufacturer>Verbatim</manufacturer> <name>DataLife MF 2HD</name> <quantity>10</quantity> <size>3.5"</size> <color>black</color> <description>floppy disks</description> </product>
G_Clef element when reading a biology
document. Some of these rules can be precisely specified with a schema
written in any of several languages, including the W3C XML Schema
Language, RELAX NG, and DTDs. A document may contain a URL indicating
where the schema can be found. Some XML parsers will notice this and
compare the document to its schema as they read it to see if the
document satisfies the constraints specified there. Such a parser is
called a <SKU>, <Record_ID>, and <author> that look superficially like
HTML tags. However, in HTML you're limited to about a hundred predefined
tags that describe web page formatting. In XML, you can create as many
tags as you need. Furthermore, these tags will mostly describe the type
of content they contain rather than formatting or layout information. In
XML you don't say that something is italicized or indented or bold, you
say that it's a book or a biography or a calendar.<person> Alan Turing </person>
<person> Alan Turing </person>
application/xml or text/xml.
However, specific XML applications may use more specific MIME media types, such as application/mathml+xml, application/xslt+xml, image/svg+xml, text/vnd.wap.wml, or even text/html (in very special cases).application/xml should be preferred to
text/xml, although many web
servers come configured out of the box to use text/xml. text/xml uses the ASCII character set as a
default, which is incorrect for most XML documents.person. The
element is delimited by the start-tag
<person> and the
end-tag </person>. Everything between the
start-tag and the end-tag of the element (exclusive) is called the
element's content . The content of this element is the text:Alan Turing
<person> and </person> are
markup . The string "Alan Turing" and its surrounding
whitespace are character data . The tag is the most common form of markup in an XML
document, but there are other kinds we'll discuss later.< and end-tags begin with
</. Both of these are followed
by the name of the element and are closed by >. However, unlike HTML tags, you are
allowed to make up new XML tags as you go along. To describe a
person, use <person> and
</person> tags. To describe
a calendar, use <calendar>
and </calendar> tags. The
names of the tags generally reflect the type of content inside the
element, not how that content will be formatted.< but ends with
/>. For instance, in
XHTML, an XMLized reformulation of standard HTML,
the line-break and horizontal-rule elements are written as
<br /> and <hr /> instead of <br> and <hr>. These are exactly equivalent
to <br></br> and
<hr></hr>, however.
Which form you use for empty elements is completely up to you.
However, what you cannot do in XML and XHTML (unlike HTML) is use
only the start-tag—for instance <br> or <hr>—without using the matching
end-tag. That would be a well-formedness error.person element has a born attribute with the value 1912-06-23 and a died attribute with the value 1954-06-07:<person born="1912-06-23" died="1954-06-07"> Alan Turing </person>
<person died = '1954-06-07' born = '1912-06-23' > Alan Turing </person>
<person> <name first="Alan" last="Turing"/> <profession value="computer scientist"/> <profession value="mathematician"/> <profession value="cryptographer"/> </person>
<).
This character is always interpreted as beginning a tag. If you need
to use this character in your text, you can escape it using the
entity reference < , the numeric character reference
<, or the
hexadecimal numeric character reference
<. When a
parser reads the document, it replaces any <, `, or < references it finds with the
actual < character. However, it
will not confuse the references with the starts of tags. For
example:<SCRIPT LANGUAGE="JavaScript">
if (location.host.toLowerCase( ).indexOf("ibiblio") < 0) {
location.href="http://ibiblio.org/xml/";
}
</SCRIPT>&)
either. This is always interpreted as beginning an entity reference.
However, the ampersand may be escaped using the & entity reference like this:<company>W.L. Gore & Associates</company>
&:<company>W.L. Gore & Associates</company>
& and character references such as
< are markup. When an
application parses an XML document, it replaces this particular markup
with the actual character or characters the reference refers to.<<)&&)>>)"'< and & must be used instead of the
literal characters in element content. The others are optional. " and ' are useful inside attribute
values where a raw " or ' might be misconstrued as ending the
attribute value. For example, this image tag uses the < and & characters in those samples must be
encoded as < and &. The more sections of literal code
a document includes and the longer they are, the more tedious this
encoding becomes. Instead you can enclose each sample of literal code
in a CDATA section . A CDATA section is
set off by <![CDATA[ and ]]>.
Everything between the <![CDATA[
and the ]]> is treated as raw
character data. Less-than signs don't begin tags. Ampersands don't
start entity references. Everything is simply character data, not
markup.<p>You can use a default <code>xmlns</code> attribute to avoid
having to add the svg prefix to all your elements:</p>
<pre><![CDATA[
<svg xmlns="http://www.w3.org/2000/svg"
width="12cm" height="10cm">
<ellipse rx="110" ry="130" />
<rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg>
]]></pre>< with <. The result will be a sample SVG
document, not an embedded SVG picture, as might happen if this example
were not placed inside a CDATA
section.CDATA section is the CDATA section end delimiter, ]]> .CDATA sections exist for the
convenience of human authors, not for programs. Parsers are not
required to tell you whether a particular block of text came from a
CDATA section, from normal
character data, or from character data that contained entity
references such as < and
&. By the time you get
access to the data, these differences will have been washed away. No
code you write should depend on the difference between them.<!-- and end with the first occurrence of -->. For example:<!-- I need to verify and update these links when I get a chance. -->
-- must not
appear anywhere inside the comment until the closing -->. In particular, a three-hyphen close
like ---> is specifically
forbidden.script element are sometimes enclosed in a
comment to protect it from display by a nonscript-aware browser. The
Apache web server parses comments in .shtml files to recognize server-side
includes. Unfortunately, these documents may not survive being passed
through various HTML editors and processors with their comments and
associated semantics intact. Worse yet, it's possible for an innocent
comment to be misconstrued as input to the application.<? and ends with ?>.
Immediately following the <? is
an XML name called the target , possibly the name of the application for which this
processing instruction is intended or possibly just an identifier for
this particular processing instruction. The rest of the processing
instruction contains text in a format appropriate for the applications
for which the instruction is intended.META tag is used to tell search-engine and
other robots whether and how they should index a page. The following
processing instruction has been proposed as an equivalent for XML
documents:<?robots index="yes" follow="no"?>
robots. The syntax of this particular
processing instruction is two pseudo-attributes, one named index and one named follow, whose values are either yes or no. The semantics of this particular
processing instruction are that if the index attribute has the value yes, then search-engine robots should index
this page. If index has the value
no, then robots should not index
the page. Similarly, if follow has
the value yes, then links from this
document will be followed; if it has the value xml and with version, standalone, and encoding pseudo-attributes. Technically,
it's not a processing instruction, though; it's just the XML
declaration, nothing more, nothing less. Example 2-7 demonstrates.<?xml version="1.0" encoding="ASCII" standalone="yes"?> <person> Alan Turing </person>
<?xml) to make some reasonable guesses
about the encoding, such as whether the document uses a single-byte or
multibyte character set. The only thing that may precede the XML
declaration is an invisible Unicode byte-order mark. We'll discuss
this further in Chapter 5.version attribute should have the value 1.0. Under very
unusual circumstances, it may also have the value 1.1. Since
specifying version="1.1" limits
the document to the most recent versions of only a couple of
parsers, and since all XML 1.1 parsers must also support XML 1.0,
you don't want to casually set the version to 1.1.< or
& signs may occur in the
character data of an element or attribute.< and the element name in a tag).li elements should only be children of
ul or ol elements. Browsers may not know what to do
with them, or may act inconsistently, if li elements appear in the middle of a blockquote or p element.ul element only contains li elements" or "Every employee element must have a social_security_number attribute." Different
XML applications can use different DTDs to specify what they do and do
not allow.<!ELEMENT name content_specification>
#PCDATA
inside parentheses. For example, this declaration says that a
phone_number element may contain
text but may not contain elements:<!ELEMENT phone_number (#PCDATA)>
CDATA sections (which are always
parsed into pure text) and comments, and processing instructions
(which don't really count in validation). It may contain entity
references only if those entity references resolve to plain text
without any child elements.fax element must contain
exactly one phone_number
element:<!ELEMENT fax (phone_number)>
fax element may not
contain anything else except the phone_number element, and it may not
contain more or less than one of those.ATTLIST
declarations. A single ATTLIST can
declare multiple attributes for a single element type. However, if the
same attribute is repeated on multiple elements, then it must be
declared separately for each element where it appears. (Later in this
chapter you'll see how to use parameter entity references to make this
repetition less burdensome.)ATTLIST declares
the source attribute of the
image element:<!ATTLIST image source CDATA #REQUIRED>
image
element has an attribute named source. The value of the source attribute is character data, and
instances of the image element in
the document are required to provide a value for the source attribute.ATTLIST declaration
can declare multiple attributes for the same element. For example,
this ATTLIST declaration not only
declares the source attribute of
the image element, but also the
width, height, and alt attributes:<!ATTLIST image source CDATA #REQUIRED
width CDATA #REQUIRED
height CDATA #REQUIRED
alt CDATA #IMPLIED
>source, width, and height attributes are required. However, the
alt attribute is optional and may
be omitted from particular image
elements. All four attributes are declared to contain character data,
the most generic attribute type.ATTLIST declarations, one
for each attribute. Whether to use one ATTLIST declaration per attribute is a
matter of personal preference, but most experienced DTD designers
prefer the multiple-attribute form. Given judicious application of
whitespace, it's no less legible than the alternative.<<)&&)>>)"'ENTITY declaration
in the DTD. This gives the name of the entity, which must be an XML
name, and the replacement text of the entity. For example, this entity
declaration defines &super; as
an abbreviation for supercalifragilisticexpialidocious:<!ENTITY super "supercalifragilisticexpialidocious">
&super; anywhere you'd normally have to
type the entire word (and probably misspell it).&footer; as an
abbreviation for a standard web page footer that will be repeated on
many pages:<!ENTITY footer '<hr size="1" noshade="true"/> <font CLASS="footer"> <a href="index.html">O'Reilly Home</a> | <a href="sales/bookstores/">O'Reilly Bookstores</a> | <a href="order_new/">How to Order</a> | <a href="oreilly/contact.html">O'Reilly Contacts</a><br> <a href="http://international.oreilly.com/">International</a> | <a href="oreilly/about.html">About O'Reilly</a> | <a href="affiliates.html">Affiliated Companies</a> </font> <p> <font CLASS="copy"> Copyright 2004, O'Reilly Media, Inc.<br/> <a href="mailto:webmaster@oreilly.com">webmaster@oreilly.com</a> </font> </p> '>
ENTITY declaration.
However, instead of the actual replacement text, the SYSTEM keyword and a URL to the replacement
text is given. For example:<!ENTITY footer SYSTEM "http://www.oreilly.com/boilerplate/footer.xml">
<!ENTITY footer SYSTEM "/boilerplate/footer.xml">
&footer; is seen in the character data
of an element, the parser may replace it with the document found at
http://www.oreilly.com/boilerplate/footer.xml.
References to external parsed entities are not allowed in attribute
values. Most of the time this shouldn't be too big a hassle because
attribute values tend to be small enough to be easily included in
internal entities.< as < or adhering to the other
constraints by which an XML document is limited. There are probably
even more JPEG photographs, GIF line art, QuickTime movies, MIDI sound
files, and so on. None of these are well-formed XML, yet all of them
are necessary components of many documents.ENTITY
declaration associates the name turing_getting_off_bus with the JPEG image
at http://www.turing.org.uk/turing/pi1/busgroup.jpg:<!ENTITY turing_getting_off_bus
SYSTEM "http://www.turing.org.uk/turing/pi1/busgroup.jpg"
NDATA jpeg>NDATA declaration
specifies the type of the data. Here the name jpeg is used. XML does not recognize this
as meaning an image in a format defined by the Joint Photographs
Experts Group. Rather this is the name of a notation declared
elsewhere in the DTD using a NOTATION
declaration like this:<!NOTATION jpeg SYSTEM "image/jpeg">
image/jpeg as the external identifier for
the notation. However, there is absolutely no standard or even a
suggestion for exactly what this identifier should be. Individual
applications must define their own requirements for the contents and
meaning of notations.ENTITY type
attribute whose value is the name of an unparsed entity declared in
the DTD. You do not use an entity reference like &turing_getting_off_bus;