8.7. A Better Tokenizer
Problem
A simple method of tokenizing—or breaking up a string into its discrete elements—was presented in Recipe 2.6. However, this is not powerful enough to handle all your string-tokenizing needs. You need a tokenizer—also referred to as a lexer—that can split up a string based on a well-defined set of characters.
Solution
Using
the Split method of the Regex
class, we can use a regular expression to indicate the types of
tokens and separators that we are interested in gathering. This
technique works especially well with equations, since the tokens of
an equation are well-defined. For example, the
code:
using System;
using System.Text.RegularExpressions;
public static string[] Tokenize(string equation)
{
Regex RE = new Regex(@"([\+\-\*\(\)\^\\])");
return (RE.Split(equation));
}will divide up a string according to the regular expression specified
in the Regex constructor. In other words, the
string passed in to the Tokenize method will be
divided up based on the delimiters +,
-, *, (,
), ^, or \.
The following method will call the Tokenize method
to tokenize the equation: (y - 3)(3111*x^21 + x
+ 320):
public void TestTokenize( )
{
foreach(string token in Tokenize("(y - 3)(3111*x^21 + x + 320)"))
Console.WriteLine("String token = " + token.Trim( ));
}which displays the following output:
String token = String token = ( String token = y String token = - String token = 3 String token = ) String token = String token = ( String token = 3111 String token = ...