Chapter 4. Selecting and Traversing

I choose a block of marble and chop off whatever I don’t need.

Francois-Auguste Rodin

I have [traversed] the length and breadth of this country and talked with the best people, and I can assure you that data processing is a fad that won’t last out the year.

The editor in charge of business books for a major publishing company, 1957

Doing anything remotely interesting in XSLT involves two related operations: determining which elements to visit (selecting) and determining in what order you want to visit them (traversing). Selecting is largely in the domain of XPath, a separate specification but one intimately related to XSLT. Traversing is a function of built-in control structures of XSLT and how you organize your templates to harness them.

XSLT veterans are unlikely to find much revelation in this particular chapter. Nevertheless, it is important for two reasons. First, the ideas presented in these recipes distinguish XSLT from other programming languages and therefore tend to be things that trip up novices on their first attempts to master XSLT. Second, the examples covered in this section are the primitive building blocks of many more complex recipes covered in later chapters. Virtually everything one does in XSLT involves application of selection and traversal. By analogy to cooking, knowing how to make a good brown stock is a prerequisite to making a sauce espagnole![7] This chapter is comprised of examples of the “brown stock” variety and should be mastered before proceeding to more advanced applications of XSLT.

Although this chapter presents some primitive examples, it is not an XPath or XSLT tutorial. The reader should know the basics. I assume you are familiar with the syntax of XPath and that you know the meaning of common XPath expressions. I also assume you know what the default XSLT processing rules are and how XSLT determines what templates should be processed next. If you feel you need some help in these areas, I would recommend either Michael Kay’s XSLT Programmer’s Reference (Wrox Press, 2001) or Doug Tidwell’s XSLT (O’Reilly, 2002).

Several examples in this section use computer-science terminology commonly associated with algorithms involving tree-data structures. The uninitiated may wonder what trees have to do with XML. The answer is that XML can be viewed as a language for specifying trees and XSLT as a language that processes these trees. In fact, it is possible to make an XSLT processor process non-XML input by making the processor think it is a tree by using a SAX driver. Michael Kay demonstrates this technique in Professional XSLT (Wrox, 2001), as does Eric M. Burke in Java and XSLT (O’Reilly, 2001). In any case, the important concept that motivates the examples in this chapter is that thinking in terms of trees allows one to discover several useful XML processing techniques.

The examples provided in the chapter draw heavily on the following two test documents. To avoid repeating them in each recipe, I list them in Example 4-1 and Example 4-2.

Example 4-1. SalesBySalesPerson.xml

<?xml version="1.0" encoding="UTF-8"?>
<salesBySalesperson>
  <salesperson name="John Adams" seniority="1">
     <product sku="10000" totalSales="10000.00"/>
     <product sku="20000" totalSales="50000.00"/>
     <product sku="25000" totalSales="920000.00"/>
  </salesperson>
  <salesperson name="Wendy Long" seniority="5">
     <product sku="10000" totalSales="990000.00"/>
     <product sku="20000" totalSales="150000.00"/>
     <product sku="30000" totalSales="5500.00"/>
  </salesperson>
  <salesperson name="Willie B. Aggressive" seniority="10">
     <product sku="10000" totalSales="1110000.00"/>
     <product sku="20000" totalSales="150000.00"/>
     <product sku="25000" totalSales="2920000.00"/>
     <product sku="30000" totalSales="115500.00"/>
     <product sku="70000" totalSales="10000.00"/>
  </salesperson>
  <salesperson name="Arty Outtolunch" seniority="10"/>
</salesBySalesperson>

Example 4-2. orgchart.xml

<?xml version="1.0" encoding="UTF-8"?>
<employee name="Jil Michel" sex="female">
     <employee name="Nancy Pratt" sex="female">
          <employee name="Phill McKraken" sex="male"/>
          <employee name="Ima Little" sex="female">
               <employee name="Betsy Ross" sex="female"/>
          </employee>
     </employee>
     <employee name="Jane Doe" sex="female">
          <employee name="Walter H. Potter" sex="male"/>
          <employee name="Wendy B.K. McDonald" sex="female">
               <employee name="Craig F. Frye" sex="male"/>
               <employee name="Hardy Hamburg" sex="male"/>
               <employee name="Rich Shaker" sex="male"/>
          </employee>
     </employee>
     <employee name="Mike Rosenbaum" sex="male">
          <employee name="Cindy Post-Kellog" sex="female">
               <employee name="Allen Bran" sex="male"/>
               <employee name="Frank N. Berry" sex="male"/>
               <employee name="Jack Apple" sex="male"/>
          </employee>
          <employee name="Oscar A. Winner" sex="male">
               <employee name="Jack Nickolas" sex="male">
                    <employee name="R.P. McMurphy" sex="male"/>
               </employee>
               <employee name="Tom Hanks" sex="male">
                    <employee name="Forest Gump" sex="male"/>
                    <employee name="Andrew Beckett" sex="male"/>
               </employee>
               <employee name="Susan Sarandon" sex="female">
                    <employee name="Helen Prejean" sex="female"/>
               </employee>
          </employee>
     </employee>
</employee>

Optimizing Node Selections

Problem

You want to improve the efficiency of your stylesheet when selecting and traversing nodes from a large XML document.

Solution

You can optimize XSLT in many ways and, when applicable, each example in this book addresses performance issues. This particular example explains several strategies for minimizing the number of elements your stylesheet will process when selecting and traversing a document.

Avoid unnecessary reliance on default processing rules

If you know exactly what you want to process and where it is located in a documents structure, you are better off going directly to the relevant nodes. Consider the following complex Microsoft Visio XML document, shown in Example 4-3, and two stylesheets for extracting shapes shown in Example 4-4 to Example 4-5.

Example 4-3. A partial Visio XML document

<VisioDocument xmlns="urn:schemas-microsoft-com:office:visio">
     <DocumentProperties>
          <!--elements elided -->
     <DocumentSettings TopPage="0" DefaultTextStyle="26" DefaultLineStyle="26"
          DefaultFillStyle="26" DefaultGuideStyle="4">
          <!--elements elided -->
     </DocumentSettings>
     <Colors>
          <!--elements elided -->
     </Colors>
     <PrintSetup>
          <!--elements elided -->
     </PrintSetup>
     <Fonts>
          <!--elements elided -->
     </Fonts>
     <StyleSheets>
          <!--elements elided -->
     </StyleSheets>
     <DocumentSheet NameU="TheDoc" LineStyle="0" FillStyle="0" TextStyle="0">
          <!--elements elided -->
     </DocumentSheet>
     <Masters>
          <!--elements elided -->
     </Masters>
     <Pages>
          <Page ID="0" NameU="Page-1" ViewScale="1" ViewCenterX="1.484375" 
               ViewCenterY="2.2552083333333">
            <!--elements elided -->
            <Shapes>
               <Shape ID="1" NameU="Square" Type="Shape" Master="0">
                 <!--elements elided -->
               </Shape>
               <Shape ID="2" Type="Shape" Master="0">
                 <!--elements elided -->
               </Shape>
            </Shapes>
          </Page>
     </Pages>
     <!--elements elided -->
</VisioDocument>

Example 4-4. A stylesheet that relies on default rules

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:vxd="urn:schemas-microsoft-com:office:visio">
   
<xsl:output method="text" />
   
<xsl:strip-space elements="*"/>
   
<xsl:template match="vxd:Shape">
     <xsl:value-of select="@ID"/>-<xsl:value-of 
                         select="@NameU"/><xsl:text>&#xa;</xsl:text>
</xsl:template>
   
<xsl:template match="text(  )"/>
     
</xsl:stylesheet>

Example 4-5. A stylesheet that bypassesdefault rules

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:vxd="urn:schemas-microsoft-com:office:visio">
   
<xsl:output method="text" />
   
<xsl:strip-space elements="*"/>
   
<xsl:template match="/">
     <xsl:apply-templates
         select="vxd:VisioDocument/vxd:Pages/vxd:Page/vxd:Shapes/vxd:Shape"/>
</xsl:template>
   
<xsl:template match="vxd:Shape">
     <xsl:value-of select="@ID"/>-<xsl:value-of
     select="@NameU"/><xsl:text>&#xa;</xsl:text>
</xsl:template>
   
<xsl:template match="text(  )"/>
     
</xsl:stylesheet>

By selecting only the elements of interest (e.g., Shapes), the second stylesheet avoids processing the entire document.

Avoid using the descendant, descendant-or-self, preceding, or following axes when they aren’t necessary

On very large documents, you will see a noticeable performance improvement if you prefer explicit XPath expressions to expressions that select nodes at an arbitrary location in the tree. Descendant, descendant-or-self, preceding, or following axes fall into the latter category.

Preferable:

<xsl:template match="/">
     <xsl:apply-templates 
         select="vxd:VisioDocument/vxd:Pages/vxd:Page/vxd:Shapes/vxd:Shape"/>
</xsl:template>

Potential performance problem:

<xsl:template match="/">
     <xsl:apply-templates select="//vxd:Shape"/>
</xsl:template>

However, many XSLT gurus overstate the performance penalty of these axes. If you actually measure performance on small- to medium-sized documents (50,000 elements or less) on a wide variety of processor implementations, you will see little performance penalty. So if you need the flexibility or convenience that these relative paths give, do not hesitate to use them until performance actually becomes a factor. On the other hand, if you have no foreknowledge of the size of document your stylesheet may be asked to process, you would be wise to err on the conservative side.

Prefer “selecting” and “matching” over “filtering”

If you need to process nodes that meet a particular criterion, then try to select the nodes rather than filtering them with <xsl:if> or <xsl:choose>. For example, consider the following three XSLT fragments in Example 4-6 through Example 4-8 that search for a person element named “John Smith” and display his age.

Example 4-6. Processing a specific person by filtering

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="text" />
     
     <xsl:template match="/">
          <xsl:apply-templates select="people"/>
     </xsl:template>
     
     <xsl:template match="people">
          <xsl:for-each select="person">
               <xsl:if test="@name='John Smith' ">
                    <xsl:value-of select="@age"/>
               </xsl:if>
          </xsl:for-each>
     </xsl:template>
</xsl:stylesheet>

Example 4-7. Processing a specific person by matching

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
     <xsl:output method="text" />
     
     <xsl:template match="/">
          <xsl:apply-templates select="people/person"/>
     </xsl:template>
     
     <xsl:template match="person[@name='John Smith']">
          <xsl:value-of select="@age"/>
     </xsl:template>
     
     <xsl:template match="person"/>
</xsl:stylesheet>

Example 4-8. Processing a specific person by selecting

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
     <xsl:output method="text" />
     
     <xsl:template match="/">
          <xsl:apply-templates select="people"/>
     </xsl:template>
     
     <xsl:template match="people">
          <xsl:for-each select="person[@name='John Smith']">
               <xsl:value-of select="@age"/>
          </xsl:for-each>
     </xsl:template>
</xsl:stylesheet>

Testing these stylesheets on a 100,000-element document using Saxon showed the selecting method to be 30% faster than the matching method and 44% faster than the filtering method. However, with Xalan and XT, the results were much closer, but the selecting method still had a slight advantage. Oddly, filtering was slightly faster (12%) than matching when using Xalan.

Cache frequently used node sets in variables

Stylesheets often refer to the same set of nodes. Rather than selecting those nodes repeatedly, it might make sense to store them in a variable. I say “might” because your mileage will vary depending on how much work needs to be done to access those nodes and how frequently you refer to them. One must also consider the memory overhead of caching nodes in this way. In fact, some processors may not cache a variable’s data unless the variable is referenced multiple times.

One case where a variable is clearly necessary is when the result was obtained from an expensive process such as a sort, as in the following:

<xsl:variable name="products">
  <xsl:for-each select="//product">
    <xsl:sort select="@sku"/>
    <xsl:copy-of select="."/>
  </xsl:for-each>
</xsl:variable>

Example 4-9 through Example 4-11 show a variables with no effect:

Example 4-9. Loading an external lookup table inline

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:us="http://www.ora.com/XSLTCookbook/nampespaces/us">
   
     <xsl:strip-space elements="*"/>
     
     <xsl:output method="text" />
   
     <xsl:template match="person">
          <xsl:value-of select="@name"/>
          <xsl:text>&#xa;</xsl:text>
          <xsl:apply-templates/>
     </xsl:template>
     
     <xsl:template match="address">
          <xsl:variable name="state" select="@state"/>
          <xsl:value-of select="@line1"/>
          <xsl:text>&#xa;</xsl:text>
          <xsl:if test="@line2">
          <xsl:value-of select="@line2"/>
          <xsl:text>&#xa;</xsl:text>
          </xsl:if>
          <xsl:value-of select="@city"/>
          <xsl:text>, </xsl:text>
          <xsl:value-of select="document('states.xml')
          /*/us:state[@abbrev=$state]/@name"/>
          <xsl:text> </xsl:text>
          <xsl:value-of select="@zip"/><xsl:text>&#xa;&#xa;</xsl:text>
     </xsl:template>
     
     <xsl:template match="text(  )"/>
     
</xsl:stylesheet>

Example 4-10. Loading an external lookup table into a variable

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:us="http://www.ora.com/XSLTCookbook/nampespaces/us">
   
     <xsl:strip-space elements="*"/>
       
     <xsl:output method="text" />
   
     <xsl:variable name="states" select="document('states.xml')/*/us:state"/>
     
     <xsl:template match="person">
          <xsl:value-of select="@name"/>
          <xsl:text>&#xa;</xsl:text>
          <xsl:apply-templates/>
     </xsl:template>
     
     <xsl:template match="address">
          <xsl:variable name="state" select="@state"/>
          <xsl:value-of select="@line1"/>
          <xsl:text>&#xa;</xsl:text>
          <xsl:if test="@line2">
          <xsl:value-of select="@line2"/>
          <xsl:text>&#xa;</xsl:text>
          </xsl:if>
          <xsl:value-of select="@city"/>
          <xsl:text>, </xsl:text>
          <xsl:value-of select="$states[@abbrev=$state]/@name"/>
          <xsl:text> </xsl:text>
          <xsl:value-of select="@zip"/><xsl:text>&#xa;&#xa;</xsl:text>
     </xsl:template>
   
     <xsl:template match="text(  )"/>

Example 4-11. A partial listing of states.xml

<?xml version="1.0" encoding="UTF-8"?>
<us:states xmlns:us="http://www.ora.com/XSLTCookbook/nampespaces/us">
     <us:state name="ALABAMA" abbrev="AL"/>
     <us:state name="ALASKA" abbrev="AK"/>
     <us:state name="AMERICAN SAMOA" abbrev="AS"/>
     <us:state name="ARIZONA" abbrev="AZ"/>
     <us:state name="ARKANSAS" abbrev="AR"/>
     <us:state name="CALIFORNIA" abbrev="CA"/>
     <us:state name="COLORADO" abbrev="CO"/>
     <us:state name="CONNECTICUT" abbrev="CT"/>
     <!--- elided -->
     <us:state name="WEST VIRGINIA" abbrev="WV"/>
     <us:state name="WISCONSIN" abbrev="WI"/>
     <us:state name="WYOMING" abbrev="WY"/>
</us:states>

You might expect that the second template is faster because it loads states.xml only once. However, for an XSLT processor to conform to the no-side-effect rule, it must load the document just once in either case; therefore, the variable provides at most an aesthetic improvement.

Use xsl:key if nodes are frequently selected by static criteria

Using a variable to cache nodes is fine when the set of nodes is constant. However, if you make repeated access to nodes selected by some criteria, you should establish an index. XSLT uses the xsl:key element in conjunction with the key( ) function to establish an index. This solution is not guaranteed by the XSLT standard to result in a performance boost, but on most implementations, it has the desired effect if used wisely. Here “wisely” means that the cost of building the index is outweighed by the frequency in which it is used. If your stylesheet uses the key only once or twice, then do not expect to see any noticeable improvement, and you may actually experience degradation.

Using a key significantly improves the following modification to the address stylesheet when it is used to process a document with many people elements:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:us="http://www.ora.com/XSLTCoobook/nampespaces/us">
   
     <xsl:strip-space elements="*"/>
     
     <xsl:output method="text" />
   
     <xsl:key name="state" match="us:state" use="@abbrev"/>
     
     <xsl:template match="person">
          <xsl:value-of select="@name"/>
          <xsl:text>&#xa;</xsl:text>
          <xsl:apply-templates/>
     </xsl:template>
     
     <xsl:template match="address">
          <xsl:variable name="state" select="@state"/>
          <xsl:value-of select="@line1"/>
          <xsl:text>&#xa;</xsl:text>
          <xsl:if test="@line2">
          <xsl:value-of select="@line2"/>
          <xsl:text>&#xa;</xsl:text>
          </xsl:if>
          <xsl:value-of select="@city"/>
          <xsl:text>, </xsl:text>
          <!-- We use for-each to change context to the states.xml doc 
          because key(  ) only locates elements in the same document as the 
          context node -->
          <xsl:for-each select="document('states.xml')">
               <xsl:value-of select="key('state',$state)/@name"/>
          </xsl:for-each>
          <xsl:text> </xsl:text>
          <xsl:value-of select="@zip"/><xsl:text>&#xa;&#xa;</xsl:text>
     </xsl:template>
     
     <xsl:template match="text(  )"/>

Discussion

Avoid unnecessarily reliance on default processing rules

As many of you know, XSLT has built-in processing rules that are invoked when XSLT is called on to process a node but no template in the stylesheet matches the node. Two rules are relevant here:

  • If no template rule matches the root (/) in the current calling mode, <xsl:apply-templates/> is automatically invoked to process the children of the root in the current calling mode.

  • If no template rule matches an element in the current calling mode, <xsl:apply-templates/> is automatically invoked to process the children of the element in the current calling mode.

These rules are useful since they effectively cause XSLT to descend the document tree until it finds something useful to do. Relying on these rules is a good practice if you want to create robust stylesheets that are insensitive to certain changes in the document structure. However, when you process documents whose structure is unlikely to change, it is best to eliminate unnecessary processing by selecting relevant nodes as early as possible.

Prefer “selecting” and “matching” over “filtering”

This recommendation is arguably the weakest of the four from a pure optimization standpoint. As XSLT processors become smarter, the performance difference between these styles is likely to grow smaller. Aesthetics, maintainability, and clarity are better drivers than performance for preferring one form over another.

Cache frequently used node sets in variables

Although judicious use of variables can result in performance improvements, you should not obsess over their use for this reason. Rather, use variables when they help clarify the code or make it easier to maintain. In two situations, variables are mandatory simply because of the nature of XSLT. Michael Kay calls them context-sensitive and temporary-tree variables. Context-sensitive variables arise in nested for-each loops when the inner loop must refer to the current context of the outer loop. In Example 4-12, the code fragment is borrowed from Example 4-14 in Recipe 4.2, $product1 is a necessary context-sensitive variable, while $product2 simply helps clarify the code.

Example 4-12. A context-sensitive variable

<xsl:template name="process-products">
     <xsl:param name="products"/>
     <xsl:for-each select="$products">
       <xsl:variable name="product1" select="."/>
       <xsl:for-each select="$products">
         <xsl:variable name="product2" select="."/>
          <-- Don't analyze the product against itself -->
          <xsl:if test="generate-id($product1) != generate-id($product2)">
            <xsl:call-template name="show-products-sold-in-common">
              <xsl:with-param name="product1" select="$product1"/>
              <xsl:with-param name="product2" select="$product2"/>
            </xsl:call-template>
          </xsl:if>
       </xsl:for-each>
     </xsl:for-each>
</xsl:template>

Temporary-tree variables arise when you want to store the result of XSLT code in a variable. This scenario arises most often when capturing the value returned by the call to a template (see Example 4-13).

Example 4-13. Storing a temporary tree

<xsl:varaible name="result-tree">
     <xsl:call-template name="compute-something"/>
</xsl:variable>

Warning

XSLT 2.0 (and 1.1 for the processors that support it) allow such variables to be processed as node sets, which is very convenient if you want to do anything significant with them. XSLT 1.0 calls them result-tree fragments, and you cannot do much more than output them without first applying a nonstandard extension node-set( ) function. With 20/20 hindsight, result-tree fragments are the classic example of a really bad idea. However, when XSLT 1.0 was developed, it was unclear how it would be used and whether technical problems would prevent the creation of temporary node trees.

Use xsl:key if nodes will be selected by static criteria frequently

Of all the optimization methods keys, those of xsl:key are the most likely to give you the biggest bang for the buck if used properly. Of course, nothing in the XSLT standard says that keys must result in the creation of an index, but any decent implementation will create one. Saxon and Xalan are two processors that show significant gain from the key facility.



[7] Sauce espagnole is highly concentrated brown stock with tomatoes bound with roux and reduced. The brown stock is made from veal and beef shanks. Sauce espagnole is necessary for making demi-glace, the “mother sauce” for many other brown sauces. When it comes to “reuse,” the chefs have the software developers beat!

Get XSLT Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.