book

C# Cookbook

by Stephen Teilhet, Jay Hilyard

January 2004

Beginner to intermediate

864 pages

22h 18m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who This Book Is For

Content preview from C# Cookbook

8.7. A Better Tokenizer

Problem

A simple method of tokenizing—or breaking up a string into its discrete elements—was presented in Recipe 2.6. However, this is not powerful enough to handle all your string-tokenizing needs. You need a tokenizer—also referred to as a lexer—that can split up a string based on a well-defined set of characters.

Solution

Using the Split method of the Regex class, we can use a regular expression to indicate the types of tokens and separators that we are interested in gathering. This technique works especially well with equations, since the tokens of an equation are well-defined. For example, the code:

using System;
using System.Text.RegularExpressions;

public static string[] Tokenize(string equation)
{
    Regex RE = new Regex(@"([\+\-\*\(\)\^\\])");
    return (RE.Split(equation));
}

will divide up a string according to the regular expression specified in the Regex constructor. In other words, the string passed in to the Tokenize method will be divided up based on the delimiters +, -, *, (, ), ^, or \. The following method will call the Tokenize method to tokenize the equation: (y - 3)(3111*x^21 + x + 320):

public void TestTokenize( )
{
    foreach(string token in Tokenize("(y - 3)(3111*x^21 + x + 320)"))
        Console.WriteLine("String token = " + token.Trim( ));
}

which displays the following output:

String token = String token = ( String token = y String token = - String token = 3 String token = ) String token = String token = ( String token = 3111 String token = ...