Chapter 4. Navigating XML by Using Paths
Path expressions are used to navigate XML input documents to select elements and attributes of interest. This chapter explains how to use path expressions to select elements and attributes from an input document and apply predicates to filter those results. It also covers the different methods of accessing input documents.
selects all the
product children of the
catalog element in the catalog.xml document. Table 4-1 shows some other simple path expressions.
||All child elements of |
Path expressions return nodes in document order. This means that the examples in Table 4-1 return the
product elements in the same order that they appear in the catalog.xml document. More information on document order and on sorting results differently can be found in Chapter 7.
Path Expressions and Context
A path expression is always evaluated relative to a particular context item, which serves as the starting point for the relative path. Some path expressions start with an initial step that sets the context item, as in:
The function call
doc("catalog.xml") returns the document node of the catalog.xml document, which becomes the context item. When the context item is a node (as opposed to an atomic value), it is called the context node. The rest of the path is evaluated relative to it. Another example is:
where the value of the variable
$catalog sets the context. The variable must select zero, one, or more nodes, which become the context nodes for the rest of the expression.
This means that the path expression will be evaluated relative to the current context node, which must have been previously determined outside the expression. It may have been set by the processor outside the scope of the query, or in an outer expression.
Steps and changing context
doc("catalog.xml") step returns one document node that serves as the context item when evaluating the
catalog step. The
catalog step is evaluated using the document node as the current context node, returning a sequence of one
catalog element child of the document node. This
catalog element then serves as the context node for evaluation of the
product step, which returns the sequence of
product children of
The final step,
number, is evaluated in turn for each
product child in this sequence. During this process, the processor keeps track of three things:
The context node itself—for example, the
productelement that is currently being processed
The position of the context node within the context sequence, which can be used to retrieve nodes based on their position
As we have seen in previous examples, steps in a path can simply be primary expressions like function calls (
doc("catalog.xml")) or variable references (
$catalog). Any expression that returns nodes can be on the lefthand side of the slash operator.
- Forward step
- Reverse step
In the examples so far,
@dept are all axis steps (that happen to be forward steps). The syntax of an axis step is shown in Figure 4-1.
Each forward or reverse step has an axis, which defines the direction and relationship of the selected nodes. For example, the
child:: axis (a forward axis) can be used to indicate that only child nodes should be selected, while the
parent:: axis (a reverse axis) can be used to indicate that only the parent node should be selected. The 12 axes are listed in Table 4-2.
In addition to having an axis, each axis step has a node test. The node test indicates which of the nodes (by name or node kind) to select, along the specified axis. For example,
child::product only selects
product element children of the context node. It does not select other kinds of children (for example, text nodes), or other
product elements that are not children of the context node.
Node name tests
In previous examples, most of the node tests were based on names, such as
dept. These are known as name tests. The syntax of a node name test is shown in Figure 4-2.
Node name tests and namespaces
Names used in node tests are qualified names, meaning that they are affected by namespace declarations. A namespace declaration is in scope if it appears in an outer element, or in the query prolog. The names may be prefixed or unprefixed. If a name is prefixed, its prefix must be bound to a namespace by using a namespace declaration.
If an element name is unprefixed, and there is an in-scope default namespace declared, it is considered to be in that namespace; otherwise, it is in no namespace. Attribute names, on the other hand, are not affected by default namespace declarations.
Use of namespace prefixes in path expressions is depicted in Example 4-1, where the
prod prefix is first bound to the namespace, and then used in the steps
prod:number. Keep in mind that the prefix is just serving as a proxy for the namespace name. It is not important that the prefixes in the path expressions match the prefixes in the input document; it is only important that the prefixes are bound to the same namespace. In Example 4-1, you could use the prefix
pr instead of
prod in the query, as long as you used it consistently throughout the query.
Example 4-1. Prefixed name tests
Input document (prod_ns.xml)
>Floppy Sun Hat
Node name tests and wildcards
You can use wildcards to match names. The step
child::* (abbreviated simply
*) can be used to select all element children, regardless of name. Likewise,
attribute::*) can be used to select all attributes, regardless of name.
In addition, wildcards can be used for just the namespace and/or local part of a name. The step
prod:* selects all child elements in the namespace bound to the prefix
prod, and the step
*:product selects all
product child elements that are in any namespace, or no namespace.
Node kind tests
In addition to the tests based on node name, you can test based on node kind. The syntax of a node kind test is shown in Figure 4-3. (The detailed syntax of
> is shown in Figure 14-4.)
node() will retrieve all different kinds of nodes. You can specify
node() as the entire step, and it will default to the
child:: axis. In this case, it will bring back child element, text, comment, and processing-instruction nodes (but not attributes, because they are not considered children). This is in contrast to
*, which selects child element nodes only.
You can also use
node() in conjunction with the axes. For example,
ancestor::node() returns all ancestor element nodes and the document node (if it exists). This is different from
ancestor::*, which returns ancestor element nodes only. You can even use
attribute::node(), which will return attribute nodes, but this is not often used because it means the same as
Four other node kind tests,
document-node(), are discussed in Chapter 22.
If you are using schemas, you can also test elements and attributes based on their type by using node kind tests. For example, you can specify
element(*, ProductType) to return all elements whose type is
element(product, ProductType) to return all elements named
product whose type is
ProductType. This is discussed further in “Sequence Types and Schemas”.
Some axes and steps can be abbreviated, as shown in Table 4-3. The abbreviations
".." are used as the entire step (with no node test). The step
"." represents the current context node itself, regardless of its node kind. Likewise, the step
".." represents the parent node, which could be either an element node or a document node.
@ abbreviation, on the other hand, replaces the axis only, so it is used along with a node test or wildcard. For example, you can use
@dept to select
dept attributes, or
@* to select all attributes.
The // abbreviation is a shorthand to indicate a descendant anywhere in a tree. For example,
catalog//number will match all
number elements at any level among the descendants of
catalog. You can start a path with
.// if you want to limit the selection to descendants of the current context node.
Table 4-4 shows additional examples of abbreviated and unabbreviated syntax.
|Unabbreviated syntax||Abbreviated equivalent|
Other Expressions as Steps
In addition to axis steps, other expressions can also be used as steps. You have already seen this in use in:
doc("catalog.xml") is a function call that is used as a step. You can include more complex expressions, for example:
doc("catalog.xml")/catalog/product/(number | name)
If the expression in a step contains an operator with lower precedence than /, it needs to be in parentheses. Some other examples of more complex steps are provided in Table 4-5.
||All children of |
||For each |
||A sequence of |
The last step (and only the last step) in a path may return atomic values rather than nodes. The last example in Table 4-5 will return a sequence of atomic values that are the substrings of the product names. Error
XPTY0019 is raised if a step that is not the last returns atomic values. For example:
doc("catalog.xml")//product/substring(name, 1, 30)/replace(., ' ', '-')
will raise an error because the
substring step returns atomic values, and it is not the last step.
Predicates are used in a path expression to filter the results to contain only items that meet specific criteria. Using a predicate, you can, for example, select only the elements that have a certain value for an attribute or child element, using a predicate like
[@dept = "ACC"]. You can also select only elements that have a particular attribute or child element, using a predicate such as
[color], or elements that occur in a particular position within their parent, using a predicate such as
The syntax of a predicate is simply an expression in square brackets ([ and ]). Table 4-6 shows some examples of predicates.
If the expression evaluates to anything other than a number, its effective boolean value is determined. This means that if it evaluates to the
false, the number 0 or
NaN, a zero-length string, or the empty sequence, it is considered
false. In most other cases, it is considered
true. If the effective boolean value is
true for a particular node, that node is returned. If it is
false, the node is not returned.
If the expression evaluates to a number, it is interpreted as the position as described in “Using Positions in Predicates”.
As you can see from the last example, the predicate is not required to appear at the end of the entire path expression. Predicates can appear at the end of any step.
product[number] is different from
product/number. While both expressions filter out products that have no
number child, in the former expression, the
product element is returned. In the latter case, the
number element is returned.
Comparisons in Predicates
The examples in the previous section use general comparison operators like
<. You can also use the corresponding value comparison operators, such as
lt, but you should be aware of the difference. Value comparison operators only allow a single value, while general comparison operators allow sequences of zero, one, or more values. Therefore, the path expression:
is acceptable, because each
priceList element can have only one
effDate attribute. However, if you wanted to find all the
priceList elements that contain the product 557, you might try the expression:
priceList if it has at least one
prod child whose
num attribute is equal to 557. It might have other
prod children whose numbers are not equal to 557.
In both cases, if a particular
priceList does not have any
prod children with
num attributes, it does not return that
priceList, but it does not raise an error.
Another difference is that value comparison operators treat all untyped data like strings. If we fixed the previous problem with
eq by returning
prod nodes instead, as in:
it would still raise an error (
XPTY0004) if no schema were present, because it treats the
num attribute like a string, which can’t be compared to a number. The
= operator, on the other hand, will cast the value of the
num attribute to
xs:integer and then compare it to 557, as you would expect.
For these reasons, general comparison operators are easier to use than value comparison operators in predicates when children are untyped or repeating. The down side of general comparison operators is that they also make it less likely that the processor will catch any mistakes you make. In addition, they may be more expensive to evaluate because it’s harder for the processor to make use of indexes.
Using Positions in Predicates
Another use of predicates is to specify the numeric position of an item within the sequence of items currently being processed. These are sometimes called, predictably, positional predicates. For example, if you want the fourth product in the catalog, you can specify:
The positions start with 1 for the first item, as opposed to 0 as they do in some programming languages. Any predicate expression that evaluates to an integer will be considered a positional predicate. If you specify a number that is greater than the number of items in the context sequence, it does not raise an error; it simply does not return any nodes. For example:
returns the empty sequence.
Understanding positional predicates
With positional predicates, it is important to understand that the position is the position within the current sequence of items being processed, not the position of an element relative to its parent’s children. Consider the expression:
This expression refers to the first
name child of each
product; the step
name is evaluated once for every
product element. It does not necessarily mean that the
name element is the first child of
It also does not return the first
name element that appears in the document as a whole. If you wanted just the first
name element in the document, you could use the expression:
because the parentheses change the order of evaluation. First, all the
name elements are returned; then, the first one of those is selected. Alternatively, you could use:
because the sequence of descendants is evaluated first, then the predicate is applied. However, this is different from the abbreviated expression:
which, like the first example, returns the first
name child of each of the products. That’s because it’s an abbreviation for:
last functions are also useful when writing predicates based on position. The
position function returns the position of the context item within the context sequence (the current sequence of items being processed). The function takes no arguments and returns an integer representing the position (starting with 1, not 0) of the context item. For example:
doc("catalog.xml")/catalog/product[position() < 3]
returns the first two
product children of
catalog. You could also select the first two children of each
product, with any name, using:
doc("catalog.xml")/catalog/product/*[position() < 3]
by using the wildcard *. Note that the predicate
[position() = 3] is equivalent to the predicate
, so the
position function is not very useful in this case.
When using positional predicates, you should be aware that the
to keyword does not work as you might expect when used in predicates. If you want the first three products, it may be tempting to use the syntax:
doc("catalog.xml")/catalog/product[1 to 3]
doc("catalog.xml")/catalog/product[position() = (1 to 3)]
doc("catalog.xml")/catalog/subsequence(product, 1, 3)
last function returns the number of nodes in the current sequence. It takes no arguments and returns an integer representing the number of items. The
last function is useful for testing whether an item is the last one in the sequence. For example,
catalog/product[last()] returns the last
product child of
Table 4-7 shows some examples of predicates that use the position of the item. The descriptions assume that there is only one
catalog element, which is the case in the catalog.xml example.
||The second |
||The second |
||The second to last |
||The last |
||The second child of |
||The second child of the third |
In XQuery, it’s very unusual to use the
last functions anywhere except within a predicate. It’s not an error, however, as long as the context item is defined. For example,
a/last() returns the same number as
Positional predicates and reverse axes
Oddly, positional predicates have the opposite meaning when using reverse axes such as
preceding-sibling. These axes, like all axes, return nodes in document order. For example, the expression:
returns the ancestors of the
i element in document order, namely the
catalog element, followed by the fourth
product element, followed by the
desc element. However, if you use a positional predicate, as in:
you might expect to get the
catalog element, but you will actually get the nearest ancestor, the
desc element. The expression:
will give you the
Using Multiple Predicates
doc("catalog.xml")/catalog/product[number < 500][@dept = "ACC"]
product elements with a number child whose value is less than 500 and whose
dept attribute is equal to
ACC. This can also be equivalently expressed as:
doc("catalog.xml")/catalog/product[number < 500 and @dept = "ACC"]
It is sometimes useful to combine the positional predicates with other predicates, as in:
doc("catalog.xml")/catalog/product[@dept = "ACC"]
which represents “the second
product child that has a
dept attribute whose value is
ACC,” namely the third
product element. The order of the predicates is significant. If the previous example is changed to:
doc("catalog.xml")/catalog/product[@dept = "ACC"]
it means something different, namely “the second
product child, if it has a
dept attribute whose value is
ACC.” This is because the predicate changes the context, and the context node for the second predicate in this case is the second
More Complex Predicates
So far, the examples of predicates have been simple path expressions, comparison expressions, and numbers. In fact, any expression is allowed in a predicate, making it a very flexible construct. For example, predicates can contain function calls, as in:
which returns all product children whose
dept attribute contains the letter A. They can contain conditional expressions, as in:
doc("catalog.xml")/catalog/product[if ($descFilter) then desc else true()]
product elements based on their
desc child only if the variable
true. They can also contain expressions that combine sequences, as in:
doc("catalog.xml")/catalog/product[* except number]
which returns all
product children that have at least one child other than
number. General comparisons with multiple values can be used, as in:
doc("catalog.xml")/catalog/product[@dept = ("ACC", "WMN", "MEN")]
which returns products whose
dept attribute value is any of those three values. This is similar to an SQL
To retrieve every third
product child of
catalog, you could use the expression:
doc("catalog.xml")/catalog/product[position() mod 3 = 0]
because it selects all the products whose position is divisible by 3.
Predicates can even contain path expressions that themselves have predicates. For example:
can be used to find all
product elements whose third child element is
*[self::colorChoices] is part of a separate path expression that is itself within a predicate.
* selects the third child element of
[self::colorChoices] is a way of testing the name of the current context element.
Predicates are not limited to use with path expressions. They can be used with any sequence. For example:
(1 to 100)[. mod 5 = 0]
can be used to return the integers from 1 to 100 that are divisible by 5. Another example is:
A Closer Look at Context
As mentioned earlier, a path expression is always evaluated relative to a particular context. The processor can set the context node outside the query, or, starting in version 3.0, the context can be explicitly specified in a context item declaration. Alternatively, the context node can be set by an outer expression. In XQuery, the only operators that change the context node are the slash, the square brackets used in predicates, and the simple map operator. For example:
In this case, the conditional expression in the last step uses the paths
name. Because it is entirely contained in one step of another (outer) path expression, it is evaluated with the context node being the
product element. Therefore,
name are tested as children of
In some cases, the context node is absent. This might occur if the processor does not set the context node outside the scope of the query or in a context item declaration, and there is no outer expression that sets the context. In these cases, using a relative path such as
desc raises error
Working with the Context Node
It is sometimes useful to be able to reference the context node, either in a step or in a predicate. A prior example retrieved
product elements whose
number child is less than 500 by using the expression:
doc("catalog.xml")/catalog/product[number < 500]
Suppose, instead, you want to retrieve the
number child itself. You can do this using the expression:
doc("catalog.xml")/catalog/product/number[. < 500]
doc("catalog.xml")/catalog/product/desc[string-length() > 20]
string-length function to test the length of the
desc value. It was not necessary to pass the “.” to the
string-length function. This is because the defined behavior of this particular function is such that if no argument is passed to the function, it defaults to the context node.
Accessing the Root
When the context node is part of a complete XML document, the root is a document node (not the outermost element). However, XQuery also allows nodes to participate in tree fragments, which can be rooted at any kind of node.
the path is evaluated relative to the root of the tree containing the current context node. For example, if the current context node is a
number element in the catalog.xml document, the path
/catalog/product retrieves all
product children of
catalog in catalog.xml.
When a path expression starts with two forward slashes, as in:
it is referring to any
product element in the tree containing the current context node. Starting an expression with / or // is allowed only if the current context node is part of a complete XML document (with a document node at its root). The / can also be used as an expression in its own right, to refer to the root of the tree containing the context node (provided this is a document node).
root function also returns the root of the tree containing a node. It can be used in conjunction with path expressions to find siblings and other elements that are in the same document. For example,
root($myNode)//product retrieves all
product elements that are in the same document (or document fragment) as
$myNode. When using the
root function, it’s not necessary for the tree to be rooted at a document node.
It is a common requirement that the paths in your query will not be static but will instead be calculated based on some input to the query. For example, if you want to provide users with a search capability where they choose the elements in the input document to search, you can’t use a static path in your query. XQuery does not provide any built-in support for evaluating dynamic paths, but you do have a couple of alternatives.
For simple paths, it is easy enough to test for an element’s name by using the
name function instead of including it directly as a step in the path. For example, if the name of the element to search and its value are bound to the variables
$searchValue, you can use a path like:
name() = $elementName][. = $searchValue]
If the dynamic path is more complex than a simple element or attribute name, you can use an implementation-specific function. Most XQuery implementations provide a function for dynamic evaluation of paths or entire queries. For example, the Saxon implementation has the
saxon:evaluate function, while in MarkLogic it is called
xdmp:eval. In Saxon, you could use the following expression to get the same results as the previous example:
$elementName,'[. = "', $searchValue, '"]'))
The Simple Map Operator
Starting in version 3.0, path expressions are supplemented by a new operator that allows you to traverse through a sequence of items and evaluate expressions based on them. It is called the simple map operator and it uses an exclamation point (!). It is in some ways similar to the path operator (/) that separates steps in path expressions, but it is more general purpose.
One limitation of the path operator (/) is that it is limited to nodes. The expression on the lefthand side of the slash must evaluate to zero or more nodes, not atomic values or other items. The expression on the righthand side of the slash can only evaluate to atomic values if it’s the last step in a path. With the simple map operator (!), either side of the operator can evaluate to zero or more items of any kind, including nodes, atomic values, or function items, or any mixture of them.
For example, the following expression takes the first three characters of each product name and converts them to lowercase:
doc("catalog.xml")//name/substring(., 1, 3) ! lower-case(.)
It returns the four string values
("fle", "flo", "del", "cot"). Like the path operator, the simple map operator sets up an iteration over the items to the left of the operator (!). For each of the four substrings returned by the expression on the lefthand side, the expression on the righthand side is evaluated. Also like the path operator, the simple map operator changes the context item, so that when the
lower-case function is called with
. as the argument, the
. refers to the current context item, which is the substring that is currently being processed. However, in this example, the path operator could not be used, because the expression on the lefthand side returns atomic values rather than nodes.
Another difference between the path operator and the simple map operator is that the path operator removes duplicates and always returns items in document order. The simple map operator does neither. For example, the following expression returns the
number elements for each
product, in that order:
doc("catalog.xml")//product ! (name, number)
If the path operator had been used instead, the elements would have been re-sorted in document order, and the
number element would appear before the
name for each product.
All of these examples could be rewritten in other ways, for example with FLWOR expressions or with differently structured path expressions that make creative use of parentheses. However, the simple map operator is convenient as a shorthand for iterating over a sequence of items.