Chapter 4. Navigating Input Documents Using Paths

Path expressions are used to navigate input documents to select elements and attributes of interest. This chapter explains how to use path expressions to select elements and attributes from an input document and apply predicates to filter those results. It also covers the different methods of accessing input documents.

Path Expressions

A path expression is made up of one or more steps that are separated by a slash (/) or double slashes (//). For example, the path:

doc("catalog.xml")/catalog/product

selects all the product children of the catalog element in the catalog.xml document. Table 4-1 shows some other simple path expressions.

Table 4-1. Simple path expressions

Example

Explanation

doc("catalog.xml")/catalog

The catalog element that is the outermost element of the document

doc("catalog.xml")//product

All product elements anywhere in the document

doc("catalog.xml")//product/@dept

All dept attributes of product elements in the document

doc("catalog.xml")/catalog/*

All child elements of the catalog element

doc("catalog.xml")/catalog/*/number

All number elements that are grandchildren of the catalog element

Path expressions return nodes in document order. This means that the examples in Table 4-1 return the product elements in the same order that they appear in the catalog.xml document. More information on document order and on sorting results differently can be found in Chapter 7.

Path Expressions and Context

A path expression is always evaluated relative to a particular context item, which serves as the starting point for the relative path. Some path expressions start with a step that sets the context item, as in:

doc("catalog.xml")/catalog/product/number

The function call doc("catalog.xml") returns the document node of the catalog.xml document, which becomes the context item. When the context item is a node (as opposed to an atomic value), it is called the context node. The rest of the path is evaluated relative to it. Another example is:

$catalog/product/number

where the value of the variable $catalog sets the context. The variable must select zero, one or more nodes, which become the context nodes for the rest of the expression.

A path expression can also be relative. For example, it can also simply start with a name, as in:

product/number

This means that the path expression will be evaluated relative to the current context node, which must have been previously determined outside the expression. It may have been set by the processor outside the scope of the query, or in an outer expression.

Steps and changing context

The context item changes with each step. A step returns a sequence of zero, one, or more nodes that serve as the context items for evaluating the next step. For example, in:

doc("catalog.xml")/catalog/product/number

the doc("catalog.xml") step returns one document node that serves as the context item when evaluating the catalog step. The catalog step is evaluated using the document node as the current context node, returning a sequence of one catalog element child of the document node. This catalog element then serves as the context node for evaluation of the product step, which returns the sequence of product children of catalog.

The final step, number, is evaluated in turn for each product child in this sequence. During this process, the processor keeps track of three things:

  • The context node itself—for example, the product element that is currently being processed

  • The context sequence, which is the sequence of items currently being processed—for example, all the product elements

  • The position of the context node within the context sequence, which can be used to retrieve nodes based on their position

Steps

As we have seen in previous examples, steps in a path can simply be primary expressions like function calls (doc("catalog.xml")) or variable references ($catalog). Any expression that returns nodes can be on the lefthand side of the slash operator.

Another kind of step is the axis step, which allows you to navigate around the XML node hierarchy. There are two kinds of axis steps:

Forward step

This step selects descendents or nodes appearing after the context node (or the context node itself).

Reverse step

This step selects ancestors or nodes appearing before the context node (or the context node itself).

In the examples so far, catalog, product, and @dept are all axis steps (that happen to be forward steps). The syntax of an axis step is shown in Figure 4-1.

Syntax of a step in a path expression

Figure 4-1. Syntax of a step in a path expression

Axes

Each forward or reverse step has an axis, which defines the direction and relationship of the selected nodes. For example, the child:: axis (a forward axis) can be used to indicate that only child nodes should be selected, while the parent:: axis (a reverse axis) can be used to indicate that only the parent node should be selected. The 12 axes are listed in Table 4-2.

Table 4-2. Axes

Axis

Meaning

self::

The context node itself.

child::

Children of the context node. Attributes are not considered children of an element. This is the default axis if none is specified.

descendant::

All descendants of the context node (children, children of children, etc.). Attributes are not considered descendants.

descendant-or-self::

The context node and its descendants.

attribute::

Attributes of the context node (if any).

following::

All nodes that follow the context node in the document, minus the context node's descendants.

following-sibling::

All siblings of the context node that follow it. Attributes of the same element are not considered siblings.

parent::

The parent of the context node (if any). This is either the element or the document node that contains it. The parent of an attribute is its element, even though it is not considered a child of that element.

ancestor::

All ancestors of the context node (parent, parent of the parent, etc.).

ancestor-or-self::

The context node and all its ancestors.

preceding::

All nodes that precede the context node in the document, minus the context node's ancestors.

preceding-sibling::

All the siblings of the context node that precede it. Attributes of the same element are not considered siblings.

Important

An additional forward axis, namespace, is supported (but deprecated) by XPath 2.0 but not supported at all by XQuery 1.0. It allows you to access the in-scope namespaces of a node.

Implementations are not required to support the following axes: following, following-sibling, ancestor, ancestor-or-self, preceding, and preceding-sibling.

Node Tests

In addition to having an axis, each axis step has a node test. The node test indicates which of the nodes (by name or node kind) to select, along the specified axis. For example, child::product only selects product element children of the context node. It does not select other kinds of children (for example, text nodes), or other product elements that are not children of the context node.

Node name tests

In previous examples, most of the node tests were based on names, such as product and dept. These are known as name tests. The syntax of a node name test is shown in Figure 4-2.

Syntax of a node name test

Figure 4-2. Syntax of a node name test

Node name tests and namespaces

Names used in node tests are qualified names, meaning that they are affected by namespace declarations. A namespace declaration is in scope if it appears in an outer element, or in the query prolog. The names may be prefixed or unprefixed. If a name is prefixed, its prefix must be mapped to a namespace using a namespace declaration.

If an element name is unprefixed, and there is an in-scope default namespace declared, it is considered to be in that namespace; otherwise, it is in no namespace. Attribute names, on the other hand, are not affected by default namespace declarations.

Use of namespace prefixes in path expressions is depicted in Example 4-1, where the prod prefix is first mapped to the namespace, and then used in the steps prod:product and prod:number. Keep in mind that the prefix is just serving as a proxy for the namespace name. It is not important that the prefixes in the path expressions match the prefixes in the input document; it is only important that the prefixes map to the same namespace. In Example 4-1, you could use the prefix pr instead of prod in the query, as long as you used it consistently throughout the query.

Example 4-1. Prefixed name tests

Input document (prod_ns.xml)
<prod:product xmlns:prod="http://datypic.com/prod">
  <prod:number>563</prod:number>
  <prod:name language="en">Floppy Sun Hat</prod:name>
</prod:product>
Query
declare namespace prod = "http://datypic.com/prod";
<prod:prodList>{
  doc("prod_ns.xml")/prod:product/prod:number
}</prod:prodList>
Results
<prod:prodList xmlns:prod="http://datypic.com/prod">
  <prod:number>563</prod:number>
</prod:prodList>

Node name tests and wildcards

You can use wildcards to match names. The step child::* (abbreviated simply *) can be used to select all element children, regardless of name. Likewise, @* (or attribute::*) can be used to select all attributes, regardless of name.

In addition, wildcards can be used for just the namespace and/or local part of a name. The step prod:* selects all child elements in the namespace mapped to the prefix prod, and the step *:product selects all product child elements that are in any namespace, or no namespace.

Node kind tests

In addition to the tests based on node name, you can test based on node kind. The syntax of a node kind test is shown in Figure 4-3.

Syntax of a node kind testThe detailed syntax of < element-attribute-test > is shown in Figure 13-4.

Figure 4-3. Syntax of a node kind test[a]

The test node( ) will retrieve all different kinds of nodes. You can specify node( ) as the entire step, and it will default to the child:: axis. In this case, it will bring back child element, text, comment, and processing-instruction nodes (but not attributes, because they are not considered children). This is in contrast to *, which selects child element nodes only.

You can also use node( ) in conjunction with the axes. For example, ancestor::node( ) returns all ancestor element nodes and the document node (if it exists). This is different from ancestor::*, which returns ancestor element nodes only. You can even use attribute::node( ), which will return attribute nodes, but this is not often used because it means the same as @*.

Four other kind tests, text( ), comment( ), processing-instruction( ), and document-node( ), are discussed in Chapter 21.

If you are using schemas, you can also test elements and attributes based on their type using node kind tests. For example, you can specify element(*, ProductType) to return all elements whose type is ProductType, or element(product, ProductType) to return all elements named product whose type is ProductType. This is discussed further in the section "Sequence Types and Schemas" in Chapter 13.

Abbreviated Syntax

Some axes and steps can be abbreviated, as shown in Table 4-3. The abbreviations "." and ".." are used as the entire step (with no node test). "." represents the current context node itself, regardless of its node kind. Likewise, the step ".." represents the parent node, which could be either an element node or a document node.

Table 4-3. Abbreviations

Abbreviation

Meaning

.

self::node( )

..

parent::node( )

@

attribute::

//

/descendant-or-self::node( )/

The @ abbreviation, on the other hand, replaces the axis only, so it is used along with a node test or wildcard. For example, you can use @dept to select dept attributes, or @* to select all attributes.

The // abbreviation is a shorthand to indicate a descendant anywhere in a tree. For example, catalog//number will match all number elements at any level among the descendants of catalog. You can start a path with .// if you want to limit the selection to descendants of the current context node.

Table 4-4 shows additional examples of abbreviated and unabbreviated syntax.

Table 4-4. Abbreviated and unabbreviated syntax examples

Unabbreviated syntax

Abbreviated equivalent

child::product

product

child::*

*

self::node( )

.

attribute::dept

@dept

attribute::*

@*

descendant::product

.//product

child::product/descendant::name

product//name

parent::node/number

../number

Other Expressions As Steps

In addition to axis steps, other expressions can also be used as steps. You have already seen this in use in:

doc("catalog.xml")/catalog/product/number

where doc("catalog.xml") is a function call that is used as a step. You can include more complex expressions, for example:

doc("catalog.xml")/catalog/product/(number | name)

which uses the parenthesized expression (number | name) to select all number and name elements. The | operator is a union operator; it selects the union of two sets of nodes.

If the expression in a step contains an operator with lower precedence than /, it needs to be in parentheses. Some other examples of more complex steps are provided in Table 4-5.

Table 4-5. More complex steps (examples start with doc("catalog.xml")/catalog/)

Example

Meaning

product/(number | name)

All number AND name children of product.

product/(* except number)

All children of product except number. See "Combining Results" in Chapter 9 for more information on the | and except operators.

product/

(if (desc) then desc else name)

For each product element, the desc child if it exists; otherwise, the name child.

product/substring(name,1,30)

A sequence of xs:string values that are substrings of product names.

The last step (and only the last step) in a path may return atomic values rather than nodes. The last example in Table 4-5 will return a sequence of atomic values that are the substrings of the product names. An error is raised if a step that is not the last returns atomic values. For example:

product/substring(name,1,30)/replace(.,' ','-')

will raise an error because the substring step returns atomic values, and it is not the last step.



[a] The detailed syntax of < element-attribute-test > is shown in Figure 13-4.

Get XQuery now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.