Tokenizing a String

Problem

You want to break a string into a list of tokens based on the occurrence of one or more delimiter characters.

Solution

Jeni Tennison implemented this solution (but the comments are my doing). The tokenizer returns each token as a node consisting of a <token> element text. It also defaults to character-level tokenization if the delimiter string is empty.

<xsl:template name="tokenize">
     <xsl:param name="string" select="''" />
  <xsl:param name="delimiters" select="' &#x9;&#xA;'" />
  <xsl:choose>
     <!-- Nothing to do if empty string -->
    <xsl:when test="not($string)" />
   
     <!-- No delimiters signals character level tokenization. -->
    <xsl:when test="not($delimiters)">
      <xsl:call-template name="_tokenize-characters">
        <xsl:with-param name="string" select="$string" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:call-template name="_tokenize-delimiters">
        <xsl:with-param name="string" select="$string" />
        <xsl:with-param name="delimiters" select="$delimiters" />
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>
   
<xsl:template name="_tokenize-characters">
  <xsl:param name="string" />
  <xsl:if test="$string">
    <token><xsl:value-of select="substring($string, 1, 1)" /></token>
    <xsl:call-template name="_tokenize-characters">
      <xsl:with-param name="string" select="substring($string, 2)" />
    </xsl:call-template>
  </xsl:if>
</xsl:template>
   
<xsl:template name="_tokenize-delimiters">
  <xsl:param name="string" />
  <xsl:param name="delimiters" />
  <xsl:param name="last-delimit"/> 
  <!-- Extract a delimiter -->
  <xsl:variable name="delimiter" select="substring($delimiters, 1, 1)" />
  <xsl:choose>
     <!-- If the delimiter is empty we have a token -->
    <xsl:when test="not($delimiter)">
      <token><xsl:value-of select="$string"/></token>
    </xsl:when>
     <!-- If the string contains at least one delimiter we must split it -->
    <xsl:when test="contains($string, $delimiter)">
      <!-- If it starts with the delimiter we don't need to handle the -->
       <!-- before part -->
      <xsl:if test="not(starts-with($string, $delimiter))">
         <!-- Handle the part that comes before the current delimiter -->
         <!-- with the next delimiter. If ther is no next the first test -->
         <!-- in this template will detect the token -->
        <xsl:call-template name="_tokenize-delimiters">
          <xsl:with-param name="string" 
                          select="substring-before($string, $delimiter)" />
          <xsl:with-param name="delimiters" 
                          select="substring($delimiters, 2)" />
        </xsl:call-template>
      </xsl:if>
       <!-- Handle the part that comes after the delimiter using the -->
       <!-- current delimiter -->
      <xsl:call-template name="_tokenize-delimiters">
        <xsl:with-param name="string" 
                        select="substring-after($string, $delimiter)" />
        <xsl:with-param name="delimiters" select="$delimiters" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
       <!-- No occurances of current delimiter so move on to next -->
      <xsl:call-template name="_tokenize-delimiters">
        <xsl:with-param name="string" 
                        select="$string" />
        <xsl:with-param name="delimiters" 
                        select="substring($delimiters, 2)" />
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>
   
</xsl:stylesheet>

Discussion

Tokenization is a common string-processing task. In languages with powerful regular-expression engines, tokenization is trivial. In this area, languages such as Perl, Python, JavaScript, and Tcl currently outshine XSLT. However, this recipe shows that XSLT can deal with tokenization if you must stay within the bounds of pure XSLT. If you are willing to use extensions, then you can defer to another language for low-level string manipulations such as tokenization.

If you use the XSLT approach and your processor does not optimize for tail-recursion, then you may want to use a divide-and-conquer algorithm for character tokenization:

<xsl:template name="_tokenize-characters">
  <xsl:param name="string" />
  <xsl:param name="len" select="string-length($string)"/>
  <xsl:choose>
       <xsl:when test="$len = 1">
       <token><xsl:value-of select="$string"/></token>
       </xsl:when>
       <xsl:otherwise>
      <xsl:call-template name="_tokenize-characters">
        <xsl:with-param name="string" 
                       select="substring($string, 1, floor($len div 2))" />
        <xsl:with-param name="len" select="floor($len div 2)"/>
      </xsl:call-template>
      <xsl:call-template name="_tokenize-characters">
        <xsl:with-param name="string" 
                      select="substring($string, floor($len div 2) + 1)" />
        <xsl:with-param name="len" select="ceiling($len div 2)"/>
      </xsl:call-template>
       </xsl:otherwise>
     </xsl:choose>
</xsl:template>

See Also

Chapter 12 shows how to access the RegEx facility in JavaScript if your XSLT processor allows JavaScript-based extensions. Java also has a built-in tokenizer (java.util.StringTokenizer).

Get XSLT Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.