Cover | Table of Contents | Colophon
http://www.unidex.com/turing/utm.htm for one
universal Turing machine written in XSLT.<?xml version="1.0"?> <product barcode="2394287410"> <manufacturer>Verbatim</manufacturer> <name>DataLife MF 2HD</name> <quantity>10</quantity> <size>3.5"</size> <color>black</color> <description>floppy disks</description> </product>
G_Clef element when reading a biology document.
Some of these rules can be precisely specified with a schema written
in any of several languages, including the W3C XML Schema Language,
RELAX NG, and DTDs. A document may contain a URL indicating where the
schema can be found. Some XML parsers will notice this and compare
the document to its schema as they read it to see if the document
satisfies the constraints specified there. Such a parser is called a
<SKU>, <Record_ID>,
and <author> that look superficially like
HTML tags. However, in HTML you're limited to about
a hundred predefined tags that describe web page formatting. In XML,
you can create as many tags as you need. Furthermore, these tags will
mostly describe the type of content they contain rather than
formatting or layout information. In XML you don't
say that something is italicized or indented or bold, you say that
it's a book or a biography or a calendar.<person> Alan Turing </person>
<person> Alan Turing </person>
application/xml
or text/xml. However,
specific XML applications may use more specific
MIME media
types, such as application/mathml+xml,
application/xslt+xml,
image/svg+xml,
text/vnd.wap.wml, or even
text/html (in very special cases).application/xml should
be preferred to text/xml, although many web
servers come configured out of the box to use
text/xml. text/xml uses the
ASCII character set as a default, which is incorrect for most XML
documents.person. The element
is delimited by the
start-tag
<person> and
the end-tag
</person>.
Everything between the start-tag and the end-tag of the element
(exclusive) is called the element's
content
. The content of this element is the
text:Alan Turing
<person> and
</person> are
markup
.
The string "Alan Turing" and its
surrounding whitespace are character
data
. The tag is the most common form of markup
in an XML document, but there are other kinds we'll
discuss later.< and end-tags begin with
</. Both of these are followed by the name of
the element and are closed by >. However,
unlike HTML tags, you are allowed to make up new XML tags as you go
along. To describe a person, use <person>
and </person> tags. To describe a calendar,
use <calendar> and
</calendar> tags. The names of the tags
generally reflect the type of content inside the element, not how
that content will be formatted.< but ends with
/>. For instance, in
XHTML, an XMLized reformulation of
standard HTML, the line-break and horizontal-rule elements are
written as <br
/> and
<hr
/> instead of
<br> and <hr>.
These are exactly equivalent to
<br></br> and
<hr></hr>, however. Which form you use
for empty elements is completely up to you. However, what you cannot
do in XML and XHTML (unlike HTML) is use only the start-tag—for
instance <br> or
<hr>—without using the matching
end-tag. That would be a well-formedness error.person element has a
born attribute with the value
1912-06-23 and a died attribute
with the value 1954-06-07:<person born="1912-06-23" died="1954-06-07"> Alan Turing </person>
<person died = '1954-06-07' born = '1912-06-23' > Alan Turing </person>
<person> <name first="Alan" last="Turing"/> <profession value="computer scientist"/> <profession value="mathematician"/> <profession value="cryptographer"/> </person>
<). This character is always interpreted as
beginning a tag. If you need to use this character in your text, you
can escape it using the entity
reference
<
, the numeric character
reference
<, or the
hexadecimal numeric character
reference
<.
When a parser reads the document, it replaces any
<, `, or
< references it finds with the actual
< character. However, it will not confuse the
references with the starts of tags. For example:<SCRIPT LANGUAGE="JavaScript">
if (location.host.toLowerCase( ).indexOf("ibiblio") < 0) {
location.href="http://ibiblio.org/xml/";
}
</SCRIPT>
&) either. This is always interpreted as
beginning an entity reference. However, the ampersand may be escaped
using the & entity reference like this:<company>W.L. Gore & Associates</company>
&:<company>W.L. Gore & Associates</company>
& and character
references such as < are markup. When an
application parses an XML document, it replaces this particular
markup with the actual character or
characters the reference refers to.<
<)&
&)< and
& characters in those samples must be encoded
as < and &. The
more sections of literal code a document includes and the longer they
are, the more tedious this encoding becomes. Instead you can enclose
each sample of literal code in a CDATA
section
. A CDATA section is
set off by <![CDATA[
and ]]>.
Everything between the <![CDATA[ and the
]]> is treated as raw character data. Less-than
signs don't begin tags. Ampersands
don't start entity references. Everything is simply
character data, not markup.<p>You can use a default <code>xmlns</code> attribute to avoid
having to add the svg prefix to all your elements:</p>
<pre><![CDATA[
<svg xmlns="http://www.w3.org/2000/svg"
width="12cm" height="10cm">
<ellipse rx="110" ry="130" />
<rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg>
]]></pre>
< with
<. The result will be a sample SVG
document, not an embedded SVG picture, as might happen if this
example were not placed inside a CDATA section.CDATA
section is the CDATA section end delimiter,
]]>
.CDATA sections exist for the convenience of human
authors, not for programs. Parsers are not required to tell you
whether a particular block of text came from a
CDATA section, from normal character data, or from
character data that contained entity references such as
< and &. By the
time you get access to the data, these differences will have been
washed away. No code you write should depend on the difference
between them.<!--
and end
with the first occurrence of -->. For example:<!-- I need to verify and update these links when I get a chance. -->
-- must not appear anywhere
inside the comment until the closing -->. In
particular, a three-hyphen close like ---> is
specifically forbidden.script element are sometimes enclosed in a
comment to protect it from display by a nonscript-aware browser. The
Apache web server parses comments in .shtml
files to recognize server-side includes. Unfortunately, these
documents may not survive being passed through various HTML editors
and processors with their comments and associated semantics intact.
Worse yet, it's possible for an innocent comment to
be misconstrued as input to the application.<?
and ends
with ?>. Immediately following the
<? is an XML name called the
target
,
possibly the name of the application for which this processing
instruction is intended or possibly just an identifier for this
particular processing instruction. The rest of the processing
instruction contains text in a format appropriate for the
applications for which the instruction is intended.META tag is used to
tell search-engine and other robots whether and how they should index
a page. The following processing instruction has been proposed as an
equivalent for XML documents:<?robots index="yes" follow="no"?>
robots. The syntax of this particular processing
instruction is two pseudo-attributes, one named
index and one named follow,
whose values are either yes or
no. The semantics of this particular processing
instruction are that if the index attribute has
the value yes, then search-engine robots should
index this page. If index has the value
no, then robots should not index the page.
Similarly, if follow has the value
yes, then links from this document will be
followed; if it has the value no, they
won't be.xml and with version,
standalone, and encoding
pseudo-attributes. Technically, it's not a
processing instruction, though; it's just the XML
declaration, nothing more, nothing less. Example 2-7
demonstrates.<?xml version="1.0" encoding="ASCII" standalone="yes"?> <person> Alan Turing </person>
<?xml) to make some reasonable guesses about
the encoding, such as whether the document uses a single-byte or
multibyte character set. The only thing that may precede the XML
declaration is an invisible Unicode byte-order mark.
We'll discuss this further in Chapter 5.version
attribute should have the value
1.0. Under very unusual circumstances, it may also have the value
1.1. Since specifying version="1.1" limits the
document to the most recent versions of only a couple of parsers, and
since all XML 1.1 parsers must also support XML 1.0, you
don't want to casually set the version to 1.1.< or &
signs may occur in the character data of an element or attribute.< and the element name in a tag).li elements should only be
children of ul or ol elements.
Browsers may not know what to do with them, or may act
inconsistently, if li elements appear in the
middle of a blockquote or p
element.ul
element only contains li
elements" or "Every
employee element must have a
social_security_number
attribute." Different XML applications can use
different DTDs to specify what they do and do not allow.<!ELEMENT name content_specification>
#PCDATA
inside parentheses. For example, this declaration says that a
phone_number element may contain text but may not
contain elements:<!ELEMENT phone_number (#PCDATA)>
CDATA sections (which are always parsed into pure
text) and comments, and processing instructions (which
don't really count in validation). It may contain
entity references only if those entity references resolve to plain
text without any child elements.fax element
must contain exactly one phone_number element:<!ELEMENT fax (phone_number)>
fax element may not contain anything else except
the ATTLIST
declarations. A single ATTLIST can declare
multiple attributes for a single element type. However, if the same
attribute is repeated on multiple elements, then it must be declared
separately for each element where it appears. (Later in this chapter
you'll see how to use parameter entity references to
make this repetition less burdensome.)ATTLIST declares the
source attribute of the image
element:<!ATTLIST image source CDATA #REQUIRED>
image element has an attribute
named source. The value of the
source attribute is character data, and instances
of the image element in the document are required
to provide a value for the source attribute.ATTLIST declaration can declare multiple
attributes for the same element. For example, this
ATTLIST declaration not only declares the
source attribute of the image
element, but also the width,
height, and alt attributes:<!ATTLIST image source CDATA #REQUIRED
width CDATA #REQUIRED
height CDATA #REQUIRED
alt CDATA #IMPLIED
>
source,
width, and height attributes
are required. However, the alt attribute is
optional and may be omitted from particular image
elements. All four attributes are declared to contain character data,
the most generic attribute type.ATTLIST declarations, one for each attribute.
Whether to use one ATTLIST declaration per
attribute is a matter of personal preference, but most experienced
DTD designers prefer the multiple-attribute form. Given judicious
application of whitespace, it's no less legible than
the alternative.<
<)&
&)>
>)"
'
ENTITY
declaration in the DTD. This gives the name of the entity, which must
be an XML name, and the replacement text of the entity. For example,
this entity declaration defines &super; as an
abbreviation for supercalifragilisticexpialidocious:<!ENTITY super "supercalifragilisticexpialidocious">
&super; anywhere you'd
normally have to type the entire word (and probably misspell it).&footer; as an
abbreviation for a standard web page footer that will be repeated on
many pages:ENTITY declaration. However, instead of
the actual replacement text, the SYSTEM keyword
and a URL to the replacement text is given. For example:<!ENTITY footer SYSTEM "http://www.oreilly.com/boilerplate/footer.xml">
<!ENTITY footer SYSTEM "/boilerplate/footer.xml">
&footer; is seen in the character data of an
element, the parser may replace it with the document found at
http://www.oreilly.com/boilerplate/footer.xml.
References to external parsed entities are not allowed in attribute
values. Most of the time this shouldn't be too big a
hassle because attribute values tend to be small enough to be easily
included in internal entities.