## Chapter 4. XPath Functions and Numeric Operators

The XPath 1.0 Recommendation specifies a number of functions and numeric operations that can be used to refine the results returned by an XPath expression.

Before getting into the details of these features’ uses, let’s take a look at a fundamental question: what are functions in the first place? (If you’re already familiar with the use of functions in programming languages, such as Java, C++, and Visual Basic, feel free to skip this section.)

## Introduction to Functions

When I was a kid, I loved watching my father work on cars. He’d been a mechanic all his life, and the automotive toolkit he’d acquired over the course of the years was exotic (to my eyes, anyhow).

One of the smaller items in Dad’s toolkit was something he called a “spark-plug gapper.” It was something like a Swiss-Army knife, with a half-dozen or so stiff steel prongs that you could swivel out from the tool’s main body. Each L-shaped prong was of a slightly different thickness; depending on the model of car you were working on and the specific spark plug’s specifications, you’d tap the end of the spark plug on the pavement and, using the gapper, ensure that the distance across which the spark was to jump was just right. There was also a small, stiff plane of sheet metal attached to the gapper, which you could use to spread the gap if you’d already closed it up too much. The objective was the get the gap just right, to ensure that the spark plug fired in just exactly the right way.

A function in computer-language terms is like a spark-plug gapper. It’s a tool provided by a software developer. You use the tool in the same general way for a given task, whenever you need to obtain some result you can’t obtain (or obtain easily) without the tool.

Almost without exception, regardless of the computer language in question, functions are represented syntactically the same way:

`function_name(arg1, ...)`

Each function (like each tool in a mechanic’s toolbox) has a distinct name. Depending on the function, you may pass one or more arguments to it, which change its behavior in various ways. The arguments are enclosed in parentheses. Thus, the spark-plug gapper might be represented like this:

`gapper(prong)`

where `gapper` is the function name and `prong`, a single argument provided (or “passed”) to the function. Under many circumstances, you wouldn’t pass a function like this the literal token `p`, `r`, `o`, `n`, `g`; rather, this is just a placeholder, a reminder to you of what you do pass to it. In this form, the function syntax is called a prototype. When you actually use (or “call” or “invoke”) a function, you typically substitute a literal value for each argument. So an actual call to our hypothetical `gapper( )` function might look like this:

`gapper(1)`

Now, many functions are fussy about both the number of arguments and the type. Depending on how the `gapper( )` function is written, for example, the following might be illegal calls to it:

```gapper(3, 6)
gapper("1")```

In the first example, there’s more than one argument passed; in the second, the argument being passed is a literal string rather than a literal number. If in fact a legal call to `gapper( )` passes only a single numeric argument, either of these two calls will fail — probably resulting in an error message of some kind.

### What Functions Do

The interesting thing about functions is that a call to one of them takes the place of some literal value in a “sentence” in the programming language in question. That is, a function returns a value.

So Dad, in my case, might say to me something like, “Here. Set the gap on this plug to fifteen-thousandths of an inch.” This would require me to know how to determine which prong on the gapper produced exactly that effect. More likely, especially if this was my first time handling the tool, he’d say, “Set the gap on this plug with prong number one” (or whatever). In a hypothetical computer language to achieve this purpose, this English-language instruction might be rendered:

`setgap `gapper(1)``

This would achieve the same effect as:

`setgap `.015``

That is, in formal terms, the `gapper( )` function, when passed a numeric value of 1, returns the numeric value .015.

### Functions Within Functions

Given, then, that a function returns a value and that the arguments passed to functions are themselves values, it’s entirely legal — even desirable, in many circumstances — to pass one function as an argument to another.

Returning to the spark plug-gapping tool, as you can see from the preceding example, the `gapper( )` function returns a value in the form of a fraction of an inch, based on the “prong number” selected. This was eminently reasonable back in Dad’s day. Now, though, most spark-plug gaps are expressed in terms of millimeters. So now we’d need a separate method for gapping a plug in metric units instead. The hypothetical computer-language expression of this might look something like:

`setgap_mm inches_to_mm(`gapper(1)`)`

which is equivalent to:

`setgap_mm inches_to_mm(`.015`)`

which in turn is equivalent to:

`setgap_mm `.38``

The `gapper( )` function here still returns .015 — the value in inches of feeler #1’s thickness. This value in turn is passed to a hypothetical conversion function, `inches_to_mm( )`, which takes a single argument — a number of inches — and converts it to millimeters.

## XPath Function Types

The functions available for use in XPath expressions are categorized by the type of value they return and/or by the type of values they operate on as arguments. These categories are node-set, string, Boolean, and numeric functions.

In each of the function prototypes in this section, I’ll use the following scheme to denote the kind of arguments passed:

`string`

Argument is a string value, to be enclosed in quotation marks in the function call. If a function call takes more than one string argument, I’ll append a number to each, as in `string1`, `string2`, and so on.

`nodeset`

Argument is a node-set, represented by an XPath location path. Note that if you’re using XPath in an XSLT stylesheet, this location path will (if it’s a relative path) be sensitive to the context established by the stylesheet at that point. Whether you’re using XPath in XSLT or an XPointer, earlier portions of a complete location path can of course establish a context for node-set references in later portions.

`boolean`

Argument has a Boolean value of true or false.

`number`

Argument has a numeric value. If a function call takes more than one numeric argument, I’ll append a number to each, as in `number1`, `number2`, and so on.

`anytype`

Argument can be any of several types. For instance, you can pass certain functions a string or a numeric argument, and the function will handle any necessary data-type conversion.

`?`

A question mark appended to one of the above data types means the argument is optional. For instance, a call to a hypothetical `my_func( )` function might come with a prototype such as `my_func(string?)`. This would mean that when you call `my_func( )`, you may supply a string argument or no argument at all. In such a case, the function will usually assume some default value for the argument, perhaps derived from the context node at the point of the function call.

Note that the type of data returned by each function is documented in a table at the start of the section dealing with the appropriate function type. Unlike functions in some traditional programming languages, XPath functions always return a value.

### Tip

XSLT and EXSLT both provide functions that go beyond XPath itself. If you need something beyond what this chapter describes, see Appendix A.

### Node-Set Functions

A node-set function, as the name implies, operates on a node-set; one of these functions also returns a node-set. Table 4-1 summarizes the functions in this category. Each function is discussed in detail in a separate subsection following the table. These discussions assume that the source document being referenced by XPath expressions is the following fragment of an XML document:

```<!DOCTYPE book:book
<!ATTLIST book:section id ID #IMPLIED>
]>
<book:book
xmlns:book="http://mynamespace/uri"
xmlns="http://www.w3.org/2000/svg">

<book:chapter>
<book:section id="sect_01">
<book:para id="para_01">Some text...
</book:para>
<svg>
<circle style="fill:yellow"
cx="100" cy="100" r="50"/>
</svg>
</book:section>
</book:chapter>

[...remainder of book...]

</book:book>```

### Tip

Note the `ATTLIST` declaration in this document’s prolog. As you’ll see, your documents may need such a declaration (in either the internal DTD subset, as here, or an external one) to determine that an attribute is of the `ID` type. However, if you’re using the MSXML XSLT processor, be aware that the `ATTLIST` declaration alone will not suffice to make the processor recognize the (in this case) `id` attribute; the processor also requires that the attribute’s parent element (`book:section`, here) be declared.

Table 4-1. Node-set functions
 Function prototype Returns Description `last( )` Number Returns the number of nodes in the context node-set `position( )` Number Returns the ordinal position of the context node within the context node-set `count(nodeset)` Number Returns the number of nodes in `nodeset` `id(string)` Node-set Returns the element node with an ID-type attribute equal to the value of the passed argument `local-name(nodeset?)` String Returns the local name (that is, the QName without a namespace prefix) of the first node in `nodeset` `namespace-uri(nodeset?)` String Returns the URI associated with the namespace prefix of the first node in `nodeset` `name(nodeset?)` String Returns the QName of the first node in `nodeset`

#### last( )

If the context node-set, at the point of the call to ```last( )```, contains 12 nodes, `last( )` returns 12.

Assume the context for the call to `last( )` is established by a location path such as the following, referencing the sample XML document above:

`//book:section/*`

This locates a node-set consisting of two element nodes, `book:para` and `svg`. Therefore, a call to `last( )` at this point returns the value 2.

Probably the most common use of `last( )` is in a location step’s predicate, as in:

`/book:book/book:chapter[`last(  )`]`

which selects the last chapter in the book.

#### position( )

This commonly used function returns the integer representing the context node’s ordinal position within the context node-set. These positions begin at 1 (for the first node in the node-set) and increment up to the value of the `last( )` function.

The sample XML document’s root element, `book:book`, has six element nodes visible along its `descendant`:: axis (`book:chapter`, `book:section`, `book:para`, `book:ref`, `svg`, and `circle`). Therefore, this location path:

`/book:book/descendant::*`

locates a node-set consisting of those six element nodes. You could locate just the `svg` node by adding a predicate, as here:

`/book:book/descendant::*[position(  )=5]`

That is, “locate the fifth node in the node-set.”

Note that the value returned by the `position( )` function is sensitive to the forward or reverse direction of the axis in effect. For forward-type axes, such as `descendant::` in the preceding example, nodes are accessed in their natural document order; for reverse-type axes, nodes are accessed in reverse document order. So we could build a location path beginning, say, at the `svg` element node and locating the ancestor `book:chapter` element with a location path such as:

`//svg/ancestor::*[position(  )=2]`

Here, the `book:section` element is ancestor #1 in reverse document order, and `book:chapter` is ancestor #2.

The `position( )` function is important, as I’ve said, for two reasons:

• It can be represented in a location step’s predicate simply by the value of the position for which you want to test. That is:

`//svg/ancestor::*[`position(  )=2`]`

and:

`//svg/ancestor::*[`2`]`

are functionally identical.

• Many XSLT operations must be performed for every nth occurrence of some kind of node in the source tree being transformed. For a simple example, perhaps you want to shade every odd row of a table, leaving the even rows unshaded: in other words, to shade every second row. This kind of processing can be achieved easily using `position( )` together with the `mod` numeric operator. I’ll describe `mod` later in this chapter and give an example of its use with `position( )`.

#### count(nodeset)

In every respect but one, the `count( )` function operates identically to the ```last( )``` function covered earlier. What makes it different is that `count( )` takes one argument; ```last( )```, none. Thus, `count( )` can be used to return the number of nodes in some arbitrary node-set other than the current one.

The following XSLT template displays the number of the current section in the chapter within which it appears, then displays (using `count( )`) the total number of `book:section` nodes in the document as a whole. Note the nested `xsl:for-each` elements, which cause processing in the template rule to “loop” through some set of operations for every node in a select node-set. Here, the outermost `xsl:for-each` element loops through every `book:chapter` element; the innermost `xsl:for-each` element loops through each `book:section` child of the selected `book:chapter`.

```<xsl:template match="/">
<xsl:for-each select="book:book/book:chapter">
Chapter <xsl:value-of select="position(  )"/>:
<xsl:for-each select="book:section">
This is section #<xsl:value-of select="position(  )"/> of
<xsl:value-of select="`last(  )`"/> within its chapter.
</xsl:for-each>
</xsl:for-each>
The total number of sections in the book is: <xsl:value-of
select="`count(//book:section)`"/>
</xsl:template>```

When you apply this template to a sample document consisting of three `book:chapter` elements — the first with one `book:section` child elements, the second with three, and the third with two — the result is:

```   Chapter 1:

This is section #1 of
1 within its chapter.

Chapter 2:

This is section #1 of
3 within its chapter.

This is section #2 of
3 within its chapter.

This is section #3 of
3 within its chapter.

Chapter 3:

This is section #1 of
2 within its chapter.

This is section #2 of
2 within its chapter.

The total number of sections in the whole book is: 6```

#### id(anytype)

Unlike the other functions in the node-set category, the ```id( )``` function actually returns a node-set, given its argument. (The argument is usually, but not always, a string; see Section 4.2.1.5, for more information.) The value of the argument locates the set of all element nodes with the indicated ID-type attribute; because the value of an ID-type attribute, by definition, must be unique within a given document, the resulting node-set thus will contain a single element node (or be empty, if no elements have an ID-type attribute with this value). You can also use a whitespace-delimited list of values to return all element nodes with any matching ID-type attributes. For instance:

`id("sect_01 sect_05 sect_88")`

returns a node-set consisting of up to three element nodes. If no element nodes match a particular value, no error condition exists. In this example, if no element node has an ID-type attribute whose value is `sect_05` but there are matches for the other two values, the resulting node-set would contain two elements.

It’s important that you heed the phrase “ID-type attribute” here. The `id( )` function ignores any attributes whose names are “id,” unless they are declared in the document’s DTD as being of the ID type. Thus:

`id("sect_01")`

successfully returns the `book:section` element with that `id` attribute value, while:

`id("para_01")`

returns an empty node-set: the former `id` attribute is expressly declared to be an ID-type attribute in the document’s DTD, while the latter is not. Perhaps more importantly, if there is no DTD at all — if the document is simply well formed — it doesn’t make any difference what value you pass to the `id( )` function; it will always in this case return an empty node-set. If you’re uncertain whether an attribute named `id` is of the ID type — or know for sure that it isn’t — test the attribute value in a location step’s predicate, as in:

`[@id="para_01"]`

or, if the context node is already the `id` attribute:

`[.="para_01"]`

Such an approach, while perhaps more prosaic, is also closer to failure-proof. (XSLT users can also take advantage of `keys` to ensure unique identifiers.)

#### id( ) and node-set arguments

The `id( )` function is unique among functions in the XPath spec in one regard. As with the other functions, if you pass it a value that is not a string, the value is treated as if it had been converted to a string by the `string( )` function (covered later in this chapter). Typically, when you pass `string( )` a node-set, it returns the string-value of only the first node in the node-set. However, when you pass `id( )` a node-set, the function returns not only a single node (whose ID-type attribute’s value would presumably match the string-value of the first node), but rather a node-set containing all element nodes whose ID-type attributes match any of the string-values of nodes in the passed node-set.

#### local-name(nodeset?)

The `local-name( )` function returns the name of a node, shorn of any namespace prefix. (You might call this the un-QName.) If the optional argument is not supplied, the function operates as if you had passed it a node-set consisting only of the context node. If the node-set contains more than one node, the function returns the local-name of only the first node (in document order) in the node-set:

`local-name(//book:chapter)`

returns the string `chapter` and:

`local-name(//svg)`

returns the string `svg`. On the other hand:

`local-name(//book:section | //svg)`

(note the compound location path) returns the string `section` — that is, the local-name of only the first node in the node-set.

#### namespace-uri(nodeset?)

When you need to know the URI associated with a given element or attribute’s namespace in an instance document, call the `namespace-uri( )` function. If you omit the optional argument, its default value is a node-set consisting of just the context node. If the node-set passed as the argument consists of more than one node, the function returns the URI associated with only the first node in the node-set. If the specified element or attribute has no associated URI, the function returns an empty string.

In the sample XML document, each element node is associated with a namespace URI. The elements with explicit `book`: prefixes are associated with the URIs tied to those prefixes via the namespace declarations for that prefix. For instance:

`namespace-uri(/book:book)`

returns the string “http://mynamespace/uri.”

Note that when an attribute node’s name is unprefixed — even when there’s an explicit default namespace declaration (`xmlns` attribute) in effect for the node — the `namespace-uri( )` function returns a null value. This expression:

`namespace-uri(//circle@style)`

returns a null string.

An attribute also does not acquire the namespace URI associated with the corresponding element automatically. For attributes in the sample document such as `id`, `style`, and `cx`, an empty string is returned as the namespace URI. However, this expression:

`namespace-uri(//book:ref/@xlink:href)`

returns the string “http://www.w3.org/1999/xlink/” — the URI associated (via the `xmlns:xlink` declaration) with the attribute’s `xlink`: namespace prefix.

By the way, as always when dealing with namespaces, remember that the exact namespace prefix is seldom relevant: what counts is the URI to which the namespace prefix is bound. For instance, at the time the XPath expression is evaluated — say, in an XSLT stylesheet or an XPointer — the namespace URI “http://mynamespace/uri” might be associated with the prefix `mybook`:. In such a context, the following two function calls return exactly the same results as long as the document containing the XPath expression binds both prefixes to the same URI:

```namespace-uri(`/book:`book)
namespace-uri(`/mybook:`book)```

The document being evaluated by the expression needn’t use either of those two prefixes in the element name, as long as whatever prefix it does use is bound to the “http://mynamespace/uri” URI.

#### name(nodeset?)

If your applications must refer to nodes in a namespace-aware fashion (as most applications do), the `name( )` function will probably be the node-name function you’ll use least often. That’s because it returns the QName of the node-set passed (or defaulted) as its sole argument — the “intuitive” name, including both the namespace prefix and the local-name portion. Therefore, `name( )` is truly reliable only when processing elements and attributes in no namespace at all. As in the other name-related functions, passing no argument at all causes `name( )` to operate on a node-set consisting of just the context node; if the node-set argument includes more than one node, the function operates on only the first.

On the face of it, using a function such as ```name( )``` might seem superfluous. After all, the most common form of a location step includes an explicit node name, and if you already know the node’s name, there’s no need for a function to return it.

Where it comes in handy is when you don’t (for one reason or another) know the name of the node in question or simply need to test the name (particularly of an attribute) against a string. For instance, you might need to isolate the nth child of a particular element, displaying its name and the names and values of all its attributes. Here’s another example from XSLT using nested `xsl:for-each` elements:

```<!-- Process the first child of each book:section element -->
<xsl:template match="/
<xsl:for-each select="book:section/*">
<!-- Display this child's name... -->
The first child's name is <xsl:value-of select="`name(  )`"/>,
and it has the following attributes:<br/>
<xsl:for-each select="@*">
<!-- ...and the name and value of each attribute -->
<xsl:value-of select="`name(  )`"/> = <xsl:value-of select="."/><br/>
</xsl:for-each>
</xsl:for-each>
</xsl:template>```

Applied to the sample XML document for this section, this template rule generates the text:

```The first child's name is book:para, and it has the following attributes:
id = para_01```

A special case of “not knowing the name of the node in question” occurs in generic XSLT stylesheets whose purpose is to describe the documents they process, displaying the names of the various nodes and their values. A portion of such a stylesheet might look something like the following:

```<!-- Process all element children of the context node -->
<xsl:template match="*">
<!-- Display the child element's name -->
Child's name: <xsl:value-of select="`name(  )`"/>
<!-- Display the child element's (string) value  -->
Child's value: <xsl:value-of select="."/>
</xsl:template>```

What makes this a special case is not necessarily that you really don’t know the node’s name — you may know very well what element names occur in your source document — rather, the general-purpose code doesn’t care what the element’s name is at this point; it treats all (child) elements the same way.

### Tip

In Chapter 2, you saw something called an “expanded-name” for the various node types. I described an algorithm there for computing an expanded-name, consisting of the namespace URI, a plus sign, and the local-name of the node. You can use the same algorithm for computing the expanded-name with an XPath expression, by concatenating for a given name the value returned by the ```namespace-uri( )``` function, the `+`, and the value returned by the `local-name( )` function. I’ll show you how to do this concatenation in the next section, in the discussion of the `concat( )` string function.

### String Functions

The set of XPath functions that operate on string arguments and/or return strings is extensive. Used in XSLT stylesheets, these functions give you enormous flexibility in terms of generating new content based on content in the source tree. In XPointers, you’ll find yourself using them most often in the predicate of XPath location steps.

Examples in this section all assume that the following XML document is being navigated via XPath:

```<dated_relics xmlns="http://mynamespace">
<relic>
<name>Smurf</name>
<price currency="USD">9.00</price>
</relic>
<relic>
<name>lava lamp</name>
<price currency="GBP">39.95</price>
</relic>
<relic>
<name>beanbag chair</name>
<price currency="EU">70.75</price>
</relic>
<relic>
<price currency="GBP">.37</price>
</relic>
<relic>
<name>blacklight</name>
<price currency="USD">323.65</price>
</relic>
<relic>
<name>VW mini-bus</name>
<price currency="USD">8500.00</price>
</relic>
<relic>
<name>open-hand chair</name>
<price currency="JPY">16865.78</price>
</relic>
</dated_relics>```

Table 4-2 summarizes the XPath string functions. A detailed discussion of each follows the table.

Table 4-2. String functions
 Function prototype Returns Description `string(anytype?)` String Returns the value of the `anytype` argument converted to a string `concat(string1, string2, ...)` String Concatenates the values of the passed arguments into a single string and returns that string’s value `starts-with(string1, string2)` boolean Returns true if `string1` begins with `string2`, false otherwise `contains(string1, string2)` boolean Returns true if `string1` contains `string2`, false otherwise `substring(string, number1, number2?)` String Returns the portion of `string` starting at character `number1`, for a length of `number2` characters `substring-before(string1, string2)` String Returns the portion of `string1` occurring before `string2` `substring-after(string1, string2)` String Returns the portion of `string1` following `string2` `string-length(string?)` Number Returns the number of characters in `string` `normalize-space(string?)` String Returns the whitespace-normalized value of `string` (that is, stripped of leading and trailing spaces, with multiple consecutive occurrences of whitespace replaced by a single space) `translate(string1, string2, string3)` String Replaces individual characters appearing in both `string1` and `string2` with corresponding characters in `string3`

#### string(anytype?)

As you might guess from its name, the `string( )` function converts the optional argument to a string of characters. There’s a set of rules for the way in which this conversion takes place, dependent on the data type of the argument:

When anytype is a node-set. When the argument is a node-set (for example, one returned by a location path), `string( )` returns the string-value of the first node in the node-set. If the indicated node-set is empty, the returned value is an empty string. If the argument is missing, it defaults to a node-set whose only member is the context node at the point of the call to ```string( )```.

When anytype is a number. If you pass `string( )` an integer numeric argument, results are pretty much what you’d expect: you get back the number in the form of a string (e.g., “365” instead of 365). A fixed- or floating-point number is converted to a string including a decimal point, at least one number to the left and one to the right of the decimal point, and an optional minus sign (for negative numbers only, obviously). The spec says that the number of trailing zeros in the latter case will always be sufficient only to distinguish the number from all other legal (IEEE 754) numeric values.

Refer back to the sample XML document. Passing ```string( )``` the value of a `price` element should produce different results, depending on the `price` element. Consider:

`string(number(price))`

Here, the price of a Smurf should be returned as the string “9” (no leading or trailing zeros, because none are needed to distinguish this or any other integer from all other legal numeric values), the price of a love bead bracelet as the string “0.37” (including a leading zero), and the price of any of the other relics in the document as simply the string-value of the element’s text node (e.g., the string “39.95” for the lava lamp).

Any form of the number 0, including positive and negative 0, is converted to the string “0.” Positive infinity is represented as the string “Infinity”; negative infinity is represented the same way, prepended with a minus sign: “-Infinity.”

You may encounter one other oddball condition when passing `string( )` a numeric argument, which arises when the argument is only supposedly a number, but for one reason or another is not. As mentioned in Chapter 2 in the discussion of data types, XPath represents a number-that-isn’t-a-number with the special value, `NaN`; if `NaN` (either literally or as the result of some calculation or function call) is passed to `string( )`, the returned value is the string “NaN.”

### Tip

The XPath Recommendation points out that passing numeric values to the `string( )` function is not intended to solve the general problem of formatting numbers as strings — for example, grouping every three positions with commas, forcing specific numbers of leading zeros, and so on. If you’re using XSLT, you can do all this with the ```format-number( )``` function and the `xsl:number` element.

When anytype is a Boolean. A Boolean argument to `string( )` returns either the value “true” or “false,” depending on the value of the argument.

When anytype is a string. A string argument passed to `string( )` returns the same string.

When anytype is any other data type. In a burst of involuted prose, the XPath spec says, “An object of a type other than the four basic types is converted to a string in a way that is dependent on that type.” Let’s see, the data types allowed under XPath are string, numeric, node-set, Boolean, and, uh....

This clause has been added to future-proof XPath against the introduction of new data types. In theory, the specifier (W3C or otherwise) of such a new data type would be obliged to provide some statement of how its values are to be represented as strings: how to derive their string-values, in short.

For instance, some future version of XPath (or an XPath-aware spec) might include a currency data type. This hypothetical spec might then say something like, “When represented as a string, values of the currency data type will include at least one integer (possibly 0) to the left of the decimal point, the decimal point itself, and at least a two-digit integer to the right of the decimal point, preceded by an optional minus sign and preceded or followed by an optional currency symbol.” And there would be your definition of how to expect `string( )` to behave when passed a currency value.

(How, exactly, an XPath function such as `string( )` is to know all these outside-of-XPath data conversion rules is a tricky question wisely sidestepped by the XPath spec.)

#### concat(string1, string2, ...)

The `concat( )` function takes at least two arguments and forges them into a single string. The function provides no padding with whitespace, so if you’re constructing (say) a list of tokens, or a set of words into a phrase or sentence, you’ve got to include the " " characters and perhaps punctuation separating one from the other. For instance, assume that the context node at a given point is any of the `relic` elements in the sample XML document. Then:

`concat(price, " (", price/@currency, ")")`

builds a string consisting of that relic’s price, a space an opening parenthesis, the currency in which the price is represented, and a closing parenthesis. Given our sample document, for the seven relics in question, this would yield the strings (respectively):

 9.00 (USD) 39.95 (GBP) 70.75 (EU) .37 (GBP) 323.65 (USD) 8500.00 (USD)

Note that the figures 9.00, .37, and 8500.00 do not follow the rules outlined above for representing numeric values as strings. If for some reason you want to force this representation, you need to explicitly convert the `price` element nodes’ string-values to numbers (using the `number( )` function discussed later), pass this result to `string( )`, and finally, pass that function’s result to `concat( )` as its first argument. Like this:

```concat(`string(number(price))`, " (",
price/@currency, ")")```

Also note in this case that the call to `string( )` is optional. Because `concat( )` expects a string-type argument, it does any necessary conversion automatically.

Earlier in this chapter, in the discussion of name-related node-set functions, I mentioned that the `concat( )` function could be used to build an expanded-name for a given element. Following the logic of James Clark’s algorithm for this process, you can build an expanded-name for a given name using:

`concat(namespace-uri(`node`), "+", local-name(`node`))`

Thus, if `node` is a `relic` element from our sample XML document, the function call returns the string “http://mynamespace+relic.”

#### starts-with(string1, string2)

The `starts-with( )` function takes two arguments, returning either the Boolean value true if the first argument starts with the value of the second, or false otherwise.

Returning Boolean values makes `starts-with()` useful primarily in a location step’s predicate. Thus:

`//price[starts-with(., ".")]`

selects all the `price` elements whose string-values start with decimal points (i.e., no leading zeros or other digits). For our sample document, the resulting node-set consists of a single `price` element: the one for the love bead bracelet, with a string-value of “.37.”

#### contains(string1, string2)

Like `starts-with( )`, the ```contains( )``` function returns a Boolean true or false and, hence, is most commonly used in a predicate. And it too takes two arguments. The value returned by the function is true if the first argument contains the second, or false otherwise.

In our sample document, we could extract a node-set consisting of all relics that are chairs using a location path such as:

`//relic[contains(name, "chair")]`

This would locate a two-node node-set: the `relic` node representing a beanbag chair and the `relic` node representing an open-hand chair.

#### substring(string, number1, number2?)

Like most programming languages, XPath provides a `substring( )` function for extracting a portion of a larger string. (Some languages, notably those derived from BASIC, call this the `mid( )` function instead.) It takes at least two arguments: the larger string from which you want to select a portion and the starting point for the selection. A third optional argument can specify the number of characters to be extracted; if this argument isn’t supplied, the extraction starts at character `number1` and goes to the end of first string.

### Note

If you’re coming to XPath from a programming language that uses 0-based indexing (such as Java), in which the first item is #0, the second #1, and so on, be aware that XPath is 1-based: the first item is #1, and so on. That is, to select the first character of a string, use:

`substring(`string`, `1`, 1)`

not:

`substring(`string`, `0`, 1)`

What happens when the number passed as the third argument exceeds the length of the string in question? This is not an error; the function behaves as though you hadn’t passed a third argument at all. If the value of the second argument is greater than the length of the string, you’ll get back an empty string as a result.

One way in which I occasionally use the ```substring( )``` function is during testing of an XSLT stylesheet. At this point, I really don’t care to see the complete contents of text nodes, particularly lengthy ones. All I care to see is that the correct text nodes are showing up in the right places. So I use `substring( )` to return, say, just the first five characters of a text node, with an ellipsis ( `. . .` ) appended. Something like this:

`concat(substring(node, 1, 5), "...")`

Given the sample XML document used in this section, when iterating through a node-set consisting of all the `name` elements, this returns a series of substrings such as the following:

 Smurf... lava ... beanb... love ... black... VW mi... open-...

#### substring-before(string1, string2) and substring-after(string1, string2)

These two functions work very similarly. They each take two arguments: a string from which you want a portion of text extracted and a string where you want the extraction terminated or begun, respectively. In either case, the `string2` argument doesn’t appear in the result returned by the function; it’s simply used as a breakpoint. If `string2` doesn’t appear in `string1` at all, the function returns an empty string.

Consider a check-writing application using the values of the `price` elements from this section’s sample XML document. Such an application takes a number such as “12.34” (assuming that what is represented is in U.S. dollars) and converts it to a phrase like “12 dollars and 34 cents.” You could use the decimal point in the `price` element as the `string2` breakpoint for calling both `substring-before( )` and `substring-after( )`, like this:

```concat(substring-before(price, "."), " dollars and ",
substring-after(price, "."), " cents")```

This would naturally have to be tweaked in various ways to be perfectly workable. You’d have to come up with alternative phrasing for non-U.S. currency and also provide for the likelihood that a given price (such as that of the love beads) lacks anything at all to the left of its decimal point. But as a demonstration of the two functions in action, it works just fine.

#### string-length(string?)

Given a passed string argument, the `string-length( )` function returns the number of characters it contains. If no argument is passed, the function operates on the context node’s string-value.

This function is sometimes directly useful in its own right. For example, you can use it to tell you whether one string is longer than another (apply it to the two strings and compare the two values returned). More often, though, you’ll see `string-length( )` used as an argument passed to another function — or the values returned by other functions as arguments passed to it. For instance:

`string-length(substring-before(price, "."))`

This returns the number of digits to the left of the decimal point in a `price` element’s string-value, which might be useful information in formatting the number a certain way.

#### normalize-space(string?)

If you’ve spent any time at all poking around XML-related specifications, you’ve probably come across the verb “normalize” and its variants. The issue here is that an XML parser is required to preserve whitespace found in a source document — to pass it unchanged to a downstream application. To normalize content is to remove extraneous whitespace — to trim leading and trailing whitespace from strings (such as text nodes) and replace multiple successive occurrences of whitespace within a string with a single blank space.

That’s what the ```normalize-space( )``` function does, cleaning up the extraneous whitespace in a string so that what’s left is “pure content.” Why would you want to do this? Because if the text content of a document has been hand-entered, you want to be sure that no extra space has crept in as a result of keyboard errors. This extra space can make comparing one string-value to another fail, even when the two nodes are apparently identical. For instance:

`item1`

and:

`item1`

are not “equal,” although they may appear so to a casual human observer; the second `item1` is preceded and followed by newlines. That is, their normalized values — as returned by `normalize-space( )`, say — are equal, but their “raw” values are not.

Note that fixing up extraneous whitespace within a string isn’t the same as removing whitespace-only text nodes. Needing to do that isn’t necessarily a problem in its own right, but can be a very big problem in XSLT applications, where the extra text nodes can play havoc with operations, such as processing even-numbered nodes one way and odd-numbered nodes a different way. This is such a big issue in XSLT that the specification includes some of its own facilities for handling such text nodes. For example, there are both `xsl:strip-space` and `xsl:preserve-space` elements for identifying the whitespace-only text nodes you want collapsed or preserved, respectively.

The `normalize-space( )` function addresses this potential problem by ensuring that you’re dealing only with the true `#PCDATA` content in a complete document or any of its nodes: pass it a string, get back the normalized result; pass it any other data type, get back the normalized corresponding string-value; and pass it nothing at all to get back the normalized string-value of the context node.

#### translate(string1, string2, string3)

The `translate( )` function replaces individual characters in one string with different individual characters. The `string1` argument is the string whose characters you want to replace; `string2` includes the specific characters in `string1` that you want to replace; and `string3` includes the characters with which you want to replace those `string2` characters. So:

`translate("1234567890", "126", "ABX")`

replaces each occurrence of any of the single characters “1,” “2,” or “6,” with the single character “A,” “B,” or “X,” respectively. The value returned from this function call would thus be the string “AB2345X7890.”

Like `normalize-space( )`, the ```translate( )``` function can be valuable in ensuring that two strings are equal, especially when their case — upper vs. lower — is possibly different, even though they’re otherwise apparently identical. Instead of comparing the two strings directly, compare their case-folded values using ```translate( )```. Thus:

```translate(somestring,
"abcdefghijklmnopqrstuvwxyz",
"ABCDEFGHIJKLMNOPQRSTUVWXYZ")```

Every lowercase “a” in `somestring` is replaced with a capital “A,” every “b” with a “B,” and so on. Characters in `somestring` that don’t match any characters in `string2` appear unchanged in the result.

Note that the lengths of `string2` and `string3` are usually identical but don’t need to be. If `string2` is longer than `string3`, ```translate( )``` serves to remove characters from `string1`. So:

```translate(somestring,
"abcdefghijklmnopqrstuvwxyz",
"")```

removes from `somestring` all lowercase letters, while:

```translate(somestring,
"abcdefghijklmnopqrstuvwxyz",
"ABCDEFGHIJKLM")```

uppercases all lowercase letters in `somestring` in the first half of the alphabet and removes all those appearing in the second half. If `somestring` is “VW mini-bus,” this returns the string “VW MII-B”: the uppercase letters “VW” (uppercase letters don’t appear in `string2`, so they’re passed unchanged), a space, the uppercased “mi” and “i” from “mini,” the hyphen, and the uppercased “b” from “bus.” The “n” in “mini” and the “us” in “bus” are suppressed.

If for some reason it’s desirable, `string3` may be longer than `string2`. This is not necessary, because the function considers only those characters in `string3` up to the length of `string2`; it’s just like you omitted those characters from `string3` in the first place.

One interesting use of `translate( )`, in conjunction with `normalize-space( )`, is to “depunctuate” a string. Thus, you can turn the string “Eek!!! Is that a mouse, or what?” into “Eek Is that a mouse or what” using:

```normalize-space(translate("Eek!!! Is that a mouse, or what?", "!,?",
"   "))```

Here, the `translate( )` function itself replaces each occurrence of the exclamation mark, comma, and question mark characters with a blank space; the outer call to `normalize-space( )` then squashes all the resulting multiple blank spaces between words into one. (If you need to do this, be sure that the length of `string3` — the blank spaces — matches the number of characters in `string2` exactly.)

While `translate( )` can be useful for limited cases, it’s not really a good general-purpose “search-and-replace” tool — particularly because you can use it only to do single-character matches and replacements. If you need to replace a single character with two or more characters, two or more characters with a single one, or two or more characters with a different set of characters — or a word or phrase with an entirely different one — `translate( )` won’t help much, if at all.

In this case, you’ll have to do more exotic string manipulation, perhaps with XSLT or a programming language.

### Boolean Functions

As the term implies, the XPath Boolean functions all return Boolean true or false values. (And when you hear the word Boolean in an XPath context, a little flag should go up in your head as you think, “Predicate.” Hold that thought.)

These functions are all quite simple, with few little gotchas or complications. Thus, I don’t think it’s necessary to provide a sample XML document for them. But following Table 4-3, which summarizes the Boolean functions, I will provide discussion and examples of each.

Table 4-3. Boolean functions
 Function prototype Returns Description `boolean(anytype)` boolean Converts `anytype` to a Boolean true or false value `not(boolean)` boolean Returns true if `boolean` is false, and false if `boolean` is true `true( )` boolean Returns the value true `false( )` boolean Returns the value false `lang(string)` boolean Returns true or false, depending on whether the language in which the context node is presented matches the value of the `string` argument

#### boolean(anytype)

The `boolean( )` function is similar to the ```string( )``` function introduced in the last section: it examines the argument passed to it and returns a value (true or false) depending on the argument’s value and data type. Also like the `string( )` function, you will almost never need to use `boolean( )` explicitly: in contexts (particularly predicates) where a logical true or false is expected, the `anytype` argument will be converted implicitly, according to the type (string, numeric, node-set, or Boolean) of the argument. Thus, each of the following subsections describes these implicit conversions as well as the result of explicit calls to `boolean( )`.

When anytype is a string. If `anytype` is at least one character long, the call to `boolean( )` returns true; otherwise, it returns false. Thus, the following two XPath expressions are functionally identical (the value true):

```boolean("some string")
string-length("some-string") > 0```

Remember that a text node may consist entirely of whitespace, as discussed earlier. This whitespace may fool the human eye but won’t fool the `boolean( )` function; newlines, spaces, tabs, and so on each count as a string with a length greater than 0.

When anytype is numeric. A call to `boolean( )` with a numeric argument returns true if the argument is a legitimate number (i.e., not the special `NaN` value) and does not equal either positive or negative zero.

In discussing the behavior of `boolean( )` with a string argument, I showed you two expressions that produced the same result. To these two we can now add a third:

`boolean(string-length("some string"))`

The nested call to `string-length( )` returns a number, which is then passed to `boolean( )`. If the number passed is 0 — that is, if the string is empty — `boolean( )` returns false, otherwise true.

When anytype is a node-set. You already know, from the previous chapter, that you can use a location path in a predicate to test for a particular node’s existence. For example:

`//employee[emp_address]`

selects only those `employee` elements that have at least one `emp_address` child.

This form of the predicate is essentially a shortcut for using the `boolean( )` function with a node-set argument. It returns true if the node-set has at least one member, or false otherwise. That is, the following is equivalent to the short form just presented:

`//employee[boolean(emp_address)]`

When anytype is a Boolean value. If `anytype` is itself a Boolean value, the value returned by `boolean( )` is identical to the value of `anytype` itself. If `anytype` is true, `boolean( )` returns true; if false, it returns false.

#### not(boolean)

The `not( )` function simply flips the value of its passed argument. If the value of `boolean` is true, `not( )` returns false and vice versa.

This function is rarely useful in its own right; rather, you pass as an argument some other expression returning a true or false value, enabling `not( )` to test for the negation of the other expression’s value. So you can select all `employee` elements that do not have at least one `emp_address` child using an expression such as:

`//employee[not(emp_address)]`

Many comparison operations in XPath look and behave peculiarly, and a particular trap to watch out for when using `not( )` is how it behaves differently from the `!` (exclamation point) Boolean operator in comparisons. Consider the following two location paths:

```//employee[@id != "emp1002"]
//employee[not(@id = "emp1002")]```

The first example selects all `employee` element nodes whose `id` attributes’ values do not equal `emp1002` (or that do not have an `id` attribute at all); the second selects all `employee` element nodes that do not have an `id` attribute whose value is `emp1002`. If you read those two clauses carefully, you’ll realize that the two location paths produce different results when encountering an element such as:

`<employee>...</employee>`

This `employee` element will not be located by the first example, because it has no `id` attribute at all; it will be located by the second example, though, because it has no `id` attribute with a value of `emp1002`.

#### true() and false( )

These two Boolean functions are of rather limited utility. You pass them no arguments, and they always return the Boolean value corresponding to their names: `true( )` always returns the value true, and `false( )` always returns false. I’ve found them useful in making explicit — documenting, as it were — the purpose of some other Boolean test. Something like this:

`//book/title[contains(., "XML") = true(  )]`

selects a `book` element only if the string-value of its `title` child contains the string “XML.” Including the ```= true( )``` doesn’t change the test at all, it simply clarifies what you’re testing for.

Maybe the most common use of `true( )` and `false( )`, though, is in XSLT. While I don’t want to plunge further here into the details of that language, it’s possible to build XSLT-based “subroutines” called named templates. You can pass parameters to a named template in a manner similar to passing arguments to a function; if the named template is driven by parameters whose values it expects to be true or false, the simplest way to pass it either of those values is with the `true( )` or `false( )` function.

#### lang(string)

Use of this function depends on the use of an `xml:lang` attribute (either directly, in an instance document, or indirectly, via its DTD). If there is no such attribute in scope at the point of the call to `lang( )`, the function returns false.

However, if there is such an attribute in scope, ```lang( )``` returns true if the context node is “in” the language specified by the string argument passed to it. Consider this code fragment:

`<word xml:lang="EN">tarradiddle</word>`

Assuming that this element or its text-node child is the context node, the following function call returns true:

`lang("EN")`

More subtly, `lang( )` also returns true in a case-insensitive way; you could also use:

```lang("en")
lang("En")```

and so on, all of which would return true.

Now, the language codes the `xml:lang` attribute uses needn’t specify major languages only, such as “EN” for English or “DE” for German. They can also specify sublanguages, or language groups, using a hyphen to separate the major language code from the one for the sublanguage. English, for example, can be represented as American English or British English using `xml:lang` values such as “en-us” and “en-uk.” Suppose the code fragment above specified an `xml:lang` attribute as follows:

`<word xml:lang="EN-UK">tarradiddle</word>`

In this case, both of the following would return true:

```lang("EN")
lang("en-uk")```

The inverse is not true. Whether ```lang( )``` returns true or false, according to the spec, depends on whether the `xml:lang` value in force for the context node “is the same as or is a sublanguage of the language specified by the argument string.” Thus, if you pass `lang( )` a string that itself identifies a sublanguage, `lang( )` will not return true when the `xml:lang` value in force is a major language. That is:

`lang("en-uk")`

returns false when applied to the following code fragment:

`<word xml:lang="EN">tarradiddle</word>`

### Numeric Functions

Numeric functions operate on their arguments to produce numeric results. Table 4-4 summarizes these functions; each is discussed separately following the table.

Examples in this section refer to the following simple XML document:

```<weights>
<weight label="1kg">1</weight>
<weight label="2.5kg">2.5</weight>
<weight label="1ton">1016.0469</weight>
</weights>```
Table 4-4. Numeric functions
 Function prototype Returns Description `number(anytype?)` Number Converts `anytype` to numeric value `sum(nodeset)` Number Returns the sum of all nodes in `nodeset`, after converting each to a number `floor(number)` Number Returns the largest integer that is less than or equal to `number` `ceiling(number)` Number Returns the smallest integer that is greater than or equal to `number` `round(number)` Number Returns the integer nearest in value to `number` (rounds up if number has a decimal portion of .5)

#### number(anytype?)

Like the `string( )` and ```boolean( )``` functions discussed earlier, ```number( )``` converts an optional argument to some basic XPath data type — numeric, in this case — based on the data type of the passed argument. If no argument is supplied, the function by default converts the context node’s string-value to a number.

When anytype is a string. To be converted to a number, a string argument must consist of optional whitespace, followed by an optional minus sign (`-`), followed by the numeric value itself, followed by optional whitespace. Any other kind of string is converted to the special value `NaN`. Note in particular that the string may not include a leading plus sign (`+`) or formatting characters, such as grouping commas or currency symbols. Among other effects, this also causes “strings” expressing numbers as scientific notation (such as “3.296E3”) to be converted to `NaN`.

When anytype is a Boolean value. If the Boolean value is true, the value returned by `number( )` is 1; if false, ```number( )``` returns 0. Thus, in this location step:

`weight[number(contains(@label, "kg"))]`

`number( )` returns 1 for both the first and second `weight` elements, and 0 for the third.

When anytype is a node-set. In this case, the argument is first converted to a string as if it had been passed to the `string( )` function discussed earlier in this chapter, and then converted to a number according to the rules for converting strings to numbers. This follows common sense; using the sample XML document in this section, for example, this expression:

`number((//weight))`

first locates the third `weight` element in the document, then returns the numeric value 1016.0469.

When anytype is numeric. Passing the `number( )` function a numeric argument simply returns the value of that argument.

#### sum(nodeset)

You can do simple summations across a node-set using the ```sum( )``` function; just pass it the node-set in question. Each node is first converted to a number using the rules of conversion laid out for the `number( )` function, then the summation is performed. We could sum up the values of all the `weight` elements in the sample document with an expression like:

`sum(//weight)`

which would return the value 1 + 2.5 + 1016.0469, or 1019.5469.

Be careful when using `sum( )` to ensure that you don’t run into a not-a-number wall; it takes only a single node with a non-numeric value to make the sum non-numeric as well. Applied to our sample document, this expression:

`sum(//weight/@label)`

returns `NaN`, because not all of the `label` attributes in the selected node-set are numeric. (Any node failing the numeric test is sufficient to produce a `NaN` result.)

#### floor(number) and ceiling(number)

The `floor( )` and `ceiling( )` functions perform similar operations on their arguments. Both return integers nearest in value to that of the argument. For `floor( )`, the result is the largest integer less than or equal to the argument; for ```ceiling( )```, the smallest integer greater than or equal to the argument. So:

`floor(//weight)`

returns 1016, while:

`ceiling(//weight)`

returns 1017.

Note that these are not exactly rounding-down and up functions. Although they consider the fractional part of the passed argument, they simply check that it’s greater than 0. If so, `floor( )` returns the integer portion of the argument and `ceiling( )`, the integer portion plus 1. If not, `floor( )` and ```ceiling( )``` both return the same result: the integer portion of the argument.

Be careful when using `floor( )` and `ceiling( )` with negative arguments. A function call like:

`floor(3.2)`

returns 3, but:

`floor(-3.2)`

returns -4.

#### round(number)

Unlike `floor( )` and `ceiling( )`, `round( )` rounds the argument up or down, depending on which direction the nearest integer lies. Thus, the result will always be identical to that of either `floor( )` or `ceiling( )`:

`round(//weight)`

returns 1016, for example (the same result obtained using `floor( )`).

If the fractional part of the passed argument is exactly .5, the `round( )` function rounds up, consistent with common use (and therefore always behaving just like ```ceiling( )```). So:

`round(//weight)`

returns 3.

As with `floor( )` and ```ceiling( )```, `round( )` can produce unexpected effects when passed a negative argument. (At least, they’re unexpected until you think a little about them.)

The calls:

```round(-3.4)
round(-3.5)
round(-3.8)```

Return the values -3, -3, and -4, respectively.

## XPath Numeric Operators

XPath includes a set of numeric operators for performing basic arithmetic operations. Don’t go looking for net-present-value or square-root operators; they don’t exist. But if you simply need to add, subtract, multiply, divide, or find a remainder of two numeric values, here’s your answer. Table 4-5 summarizes these numeric operators.

Table 4-5. XPath numeric operators
 Operator Description Example `+` Adds two values `(//weight) + (//weight)` `-` Subtracts one value from another `(//weight) - (//weight)` `*` Multiplies one value times another `(//weight) * 5` `div` Divides one value by another `(//weight) div 1016.0469` `mod` Returns the remainder after dividing one value by another `(//weight) mod 1016.0469`

Most of these are straightforward, not requiring any further explanation; however, both the `div` and `mod` operators could use bit more explanation.

### div

Why use a special `div` operator at all? Why not just use the more familiar forward slash character, `/`, to divide one value by another?

The answer is that a slash in an XPath expression is already freighted with meaning: it operates as a delimiter between location steps. (A good analogy, in XML terms, might be the required use of entity references, such as `&lt;` instead of the literal `<` character.)

### mod

Unlike `div`, the `mod` operator is common in other application languages as well as in XPath. The term “mod” comes from modulus or modulo — the formal arithmetic term for the remainder following a division. (Some languages use a single character, like the percent sign, `%`, to perform the same operation.)

I promised, earlier in this chapter, to show you how to use `mod` with the `position( )` function to process every nth node in a given node-set.

The basic idea is first to isolate what n is, then compare the remainder of dividing a given node’s position in the node-set by n. If the remainder is 0, the node in question gets the special “every nth node” processing, otherwise it doesn’t.

Suppose we have a list of employees in an XML document, coded something like this (irrelevant details omitted):

```<employees>
<employee>...</employee>
<employee>...</employee>
<employee>...</employee>
<employee>...</employee>
<employee>...</employee>
<employee>...</employee>
</employees>```

As you can see, this document includes six `employee` elements within the `employees` container element. If we want to perform some particular operation just for the `employee` elements in even-numbered positions within the node-set, we could use an XPath expression such as:

`//employee[position(  ) mod 2 = 0]`

If we want this operation to occur on every odd-numbered employee in the list, we change the predicate as in this example:

`//employee[position(  ) mod 2 = 1]`

If we want to select every third employee, change the 2 in the above examples to 3; for every fourth, change it to 4; and so on.

The `mod( )` function is also useful for certain conversion-type operations, such as converting raw quantities of something to dozens-of-something-plus-leftover-units and four-digit years to their two-digit values. For instance:

`1960 mod 100`

returns the value 60.

 Which I think of as one more than a gazillion.

Get XPath and XPointer now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.