Chapter 2. Using Regular Expressions

2.0. Introduction

Regular expressions are search patterns that can be used to find text that matches a given pattern. For instance, in the last chapter, we looked for the substring Cookbook within a longer string:

var testValue = "This is the Cookbook's test string";
var subsValue = "Cookbook";

var iValue = testValue(subsValue); // returns value of 12, index of substring

This code snippet worked because we were looking for an exact match. But what if we wanted a more general search? For instance, we want to search for the words Cook and Book, in strings such as “Joe’s Cooking Book” or “JavaScript Cookbook”?

When we’re looking for strings that match a pattern rather than an exact string, we need to use regular expressions. We can try to make do with String functions, but in the end, it’s actually simpler to use regular expressions, though the syntax and format is a little odd and not necessarily “user friendly.”

Recently, I was looking at code that pulled the RGB values from a string, in order to convert the color to its hexadecimal format. We’re tempted to just use the String.split function, and split on the commas, but then you have to strip out the parentheses and extraneous whitespace. Another consideration is how can we be sure that the values are in octal format? Rather than:

rgb (255, 0, 0)

we might find:

rgb (100%, 0, 0)

There’s an additional problem: some browsers return a color, such as a background color, as an RGB value, others as a hexadecimal. You need to be able to handle both when building a consistent conversion routine.

In the end, it’s a set of regular expressions that enable us to solve what, at first, seems to be a trivial problem, but ends up being much more complicated. In an example from the popular jQuery UI library, regular expressions are used to match color values—a complicated task because the color values can take on many different formats, as this portion of the routine demonstrates:

// Look for #a0b1c2
if (result = /#([a-fA-F0-9]{2})([a-fA-F0-9]{2})([a-fA-F0-9]{2})/.exec(color))
   return [parseInt(result[1],16), parseInt(result[2],16), parseInt(result[3],16)];

// Look for #fff
if (result = /#([a-fA-F0-9])([a-fA-F0-9])([a-fA-F0-9])/.exec(color))
    return [parseInt(result[1]+result[1],16), parseInt(result[2]+result[2],16),
parseInt(result[3]+result[3],16)];

// Look for rgba(0, 0, 0, 0) == transparent in Safari 3
if (result = /rgba\(0, 0, 0, 0\)/.exec(color))
    return colors['transparent'];

// Otherwise, we're most likely dealing with a named color
    return colors[$.trim(color).toLowerCase()];

Though the regular expressions seem complex, they’re really nothing more than a way to describe a pattern. In JavaScript, regular expressions are managed through the RegExp object.

A RegExp Literal

As with String in Chapter 1, RegExp can be both a literal and an object. To create a RegExp literal, you use the following syntax:

var re = /regular expression/;

The regular expression pattern is contained between opening and closing forward slashes. Note that this pattern is not a string: you do not want to use single or double quotes around the pattern, unless the quotes themselves are part of the pattern to match.

Regular expressions are made up of characters, either alone or in combination with special characters, that provide for more complex matching. For instance, the following is a regular expression for a pattern that matches against a string that contains the word Shelley and the word Powers, in that order, and separated by one or more whitespace characters:

var re = /Shelley\s+Powers/;

The special characters in this example are the backslash character (\), which has two purposes: either it’s used with a regular character, to designate that it’s a special character; or it’s used with a special character, such as the plus sign (+), to designate that the character should be treated literally. In this case, the backslash is used with “s”, which transforms the letter s to a special character designating a whitespace character, such as a space, tab, line feed, or form feed. The \s special character is followed by the plus sign, \s+, which is a signal to match the preceding character (in this example, a whitespace character) one or more times. This regular expression would work with the following:

Shelley Powers

It would also work with the following:

Shelley     Powers

It would not work with:

ShelleyPowers

It doesn’t matter how much whitespace is between Shelley and Powers, because of the use of \s+. However, the use of the plus sign does require at least one whitespace character.

Table 2-1 shows the most commonly used special characters in JavaScript applications.

Table 2-1. Regular expression special characters

Character

Matches

Example

^

Matches beginning of input

/^This/ matches “This is...”

$

Matches end of input

/end?/ matches “This is the end”

*

Matches zero or more times

/se*/ matches “seeee” as well as “se”

?

Matches zero or one time

/ap?/ matches “apple” and “and”

+

Matches one or more times

/ap+/ matches “apple” but not “and”

{n}

Matches exactly n times

/ap{2}/ matches “apple” but not “apie”

{n,}

Matches n or more times

/ap{2,}/ matches all p’s in “apple” and “appple” but not “apie”

{n,m}

Matches at least n, at most m times

/ap{2,4}/ matches four p’s in “apppppple”

.

Any character except newline

/a.e/ matches “ape” and “axe”

[...]

Any character within brackets

/a[px]e/ matches “ape” and “axe” but not “ale”

[^...]

Any character but those within brackets

/a[^px]/ matches “ale” but not “axe” or “ape”

\b

Matches on word boundary

/\bno/ matches the first “no” in “nono”

\B

Matches on nonword boundary

/\Bno/ matches the second “no” in “nono”

\d

Digits from 0 to 9

/\d{3}/ matches 123 in “Now in 123”

\D

Any nondigit character

/\D{2,4}/ matches “Now " in “Now in 123”

\w

Matches word character (letters, digits, underscores)

/\w/ matches “j” in javascript

\W

Matches any nonword character (not letters, digits, or underscores)

\/W/ matches “%” in “100%”

\n

Matches a line feed

 

\s

A single whitespace character

 

\S

A single character that is not whitespace

 

\t

A tab

 

(x)

Capturing parentheses

Remembers the matched characters

RegExp As Object

The RegExp is a JavaScript object as well as a literal, so it can also be created using a constructor, as follows:

var re = new RegExp("Shelley\s+Powers");

When to use which? The RegExp literal is compiled when script is evaluated, so you should use a RegExp literal when you know the expression won’t change. A compiled version is more efficient. Use the constructor when the expression changes or is going to be built or provided at runtime.

As with other JavaScript objects, RegExp has several properties and methods, the most common of which are demonstrated throughout this chapter.

Note

Regular expressions are powerful but can be tricky. This chapter is more an introduction to how regular expressions work in JavaScript than to regular expressions in general. If you want to learn more about regular expressions, I recommend the excellent Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan (O’Reilly).

See Also

The jQuery function shown in the first section is a conversion of a jQuery internal function incorporated into a custom jQuery plug-in. jQuery is covered in more detail in Chapter 17, and a jQuery plug-in is covered in Recipe 17.7.

2.1. Testing Whether a Substring Exists

Problem

You want to test whether a string is contained in another string.

Solution

Use a JavaScript regular expression to define a search pattern, and then apply the pattern against the string to be searched, using the RegExp test method. In the following, we want to match with any string that has the two words, Cook and Book, in that order:

var cookbookString = new Array();

cookbookString[0] = "Joe's Cooking Book";
cookbookString[1] = "Sam's Cookbook";
cookbookString[2] = "JavaScript CookBook";
cookbookString[3] = "JavaScript BookCook";

// search pattern
var pattern = /Cook.*Book/;
for (var i = 0; i < cookbookString.length; i++)
  alert(cookbookString[i] + " " + pattern.test(cookbookString[i]));

The first and third strings have a positive match, while the second and fourth do not.

Discussion

The RegExp test method takes two parameters: the string to test, and an optional modifier. It applies the regular expression against the string and returns true if there’s a match, false if there is no match.

In the example, the pattern is the word Cook appearing somewhere in the string, and the word Book appearing anywhere in the string after Cook. There can be any number of characters between the two words, including no characters, as designated in the pattern by the two regular expression characters: the decimal point (.), and the asterisk (*).

The decimal in regular expressions is a special character that matches any character except the newline character. In the example pattern, the decimal is followed by the asterisk, which matches the preceding character zero or more times. Combined, they generate a pattern matching zero or more of any character, except newline.

In the example, the first and third string match, because they both match the pattern of Cook and Book with anything in between. The fourth does not, because the Book comes before Cook in the string. The second also doesn’t match, because the first letter of book is lowercase rather than uppercase, and the matching pattern is case-dependent.

2.2. Testing for Case-Insensitive Substring Matches

Problem

You want to test whether a string is contained in another string, but you don’t care about the case of the characters in either string.

Solution

When creating the regular expression, use the ignore case flag (i):

var cookbookString = new Array();

cookbookString[0] = "Joe's Cooking Book";
cookbookString[1] = "Sam's Cookbook";
cookbookString[2] = "JavaScript CookBook";
cookbookString[3] = "JavaScript cookbook";

// search pattern
var pattern = /Cook.*Book/i;
for (var i = 0; i < cookbookString.length; i++) {
  alert(cookbookString[i] + " " + pattern.test(cookbookString[i],i));
}

All four strings match the pattern.

Discussion

The solution uses a regular expression flag (i) to modify the constraints on the pattern-matching. In this case, the flag removes the constraint that the pattern-matching has to match by case. Using this flag, values of book and Book would both match.

There are only a few regular expression flags, as shown in Table 2-2. They can be used with RegExp literals:

var pattern = /Cook.*Book/i; // the 'i' is the ignore flag

They can also be used when creating a RegExp object, via the optional second parameter:

var pattern = new RegExp("Cook.*Book","i");
Table 2-2. Regular expression flags

Flag

Meaning

g

Global match: matches across an entire string, rather than stopping at first match

i

Ignores case

m

Applies begin and end line special characters (^ and $, respectively) to each line in a multiline string

2.3. Validating a Social Security Number

Problem

You need to validate whether a text string is a valid U.S.-based Social Security number (the identifier the tax people use to find us, here in the States).

Solution

Use the String match method and a regular expression to validate that a string is a Social Security number:

var ssn = document.getElementById("pattern").value;
var pattern = /^\d{3}-\d{2}-\d{4}$/;
if (ssn.match(pattern))
  alert("OK");
else
  alert("Not OK");

Discussion

A U.S.-based Social Security number is a combination of nine numbers, typically in a sequence of three numbers, two numbers, and four numbers, with or without dashes in between.

The numbers in a Social Security number can be matched with the digit special character (\d). To look for a set number of digits, you can use the curly brackets surrounding the number of expected digits. In the example, the first three digits are matched with:

\d{3}

The second two sets of numbers can be defined using the same criteria. Since there’s only one dash between the sequences of digits, it can be given without any special character. However, if there’s a possibility the string will have a Social Security number without the dashes, you’d want to change the regular expression pattern to:

var pattern = /^\d{3}-?\d{2}-?\d{4}$/;

The question mark special character (?) matches zero or exactly one of the preceding character—in this case, the dash (-). With this change, the following would match:

444-55-3333

As would the following:

555335555

But not the following, which has too many dashes:

555---60--4444

One other characteristic to check is whether the string consists of the Social Security number, and only the Social Security number. The beginning-of-input special character (^) is used to indicate that the Social Security number begins at the beginning of the string, and the end-of-line special character ($) is used to indicate that the line terminates at the end of the Social Security number.

Since we’re only interested in verifying that the string is a validly formatted Social Security number, we’re using the String object’s match method. We could also have used the RegExp test method, but six of one, half dozen of the other; both approaches are acceptable.

There are other approaches to validating a Social Security number that are more complex, based on the principle that Social Security numbers can be given with spaces instead of dashes. That’s why most websites asking for a Social Security number provide three different input fields, in order to eliminate the variations. Regular expressions should not be used in place of good form design.

In addition, there is no way to actually validate that the number given is an actual Social Security number, unless you have more information about the person, and a database with all Social Security numbers. All you’re doing with the regular expression is verifying the format of the number.

See Also

One site that provides some of the more complex Social Security number regular expressions, in addition to many other interesting regular expression “recipes,” is the Regular Expression Library.

2.4. Finding and Highlighting All Instances of a Pattern

Problem

You want to find all instances of a pattern within a string.

Solution

Use the RegExp exec method and the global flag (g) in a loop to locate all instances of a pattern, such as any word that begins with t and ends with e, with any number of characters in between:

var searchString = "Now is the time and this is the time and that is the time";
var pattern = /t\w*e/g;
var matchArray;

var str = "";
while((matchArray = pattern.exec(searchString)) != null) {
  str+="at " + matchArray.index + " we found " + matchArray[0] + "<br />";
}
document.getElementById("results").innerHTML=str;

Discussion

The RegExp exec method executes the regular expression, returning null if a match is not found, or an array of information if a match is found. Included in the returned array is the actual matched value, the index in the string where the match is found, any parenthetical substring matches, and the original string.

index

The index of the located match

input

The original input string

[0] or accessing array directly

The matched value

[1],...,[n]

Parenthetical substring matches

In the solution, the index where the match was found is printed out in addition to the matched value.

The solution also uses the global flag (g). This triggers the RegExp object to preserve the location of each match, and to begin the search after the previously discovered match. When used in a loop, we can find all instances where the pattern matches the string. In the solution, the following are printed out:

at 7 we found the
at 11 we found time
at 28 we found the
at 32 we found time
at 49 we found the
at 53 we found time

Both time and the match the pattern.

Let’s look at the nature of global searching in action. In Example 2-1, a web page is created with a textarea and an input text box for accessing both a search string and a pattern. The pattern is used to create a RegExp object, which is then applied against the string. A result string is built, consisting of both the unmatched text and the matched text, except the matched text is surrounded by a span element, with a CSS class used to highlight the text. The resulting string is then inserted into the page, using the innerHTML for a div element.

Example 2-1. Using exec and global flag to search and highlight all matches in a text string
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Searching for strings</title>
<style type="text/css">
#searchSubmit
{
   background-color: #ff0;
   width: 200px;
   text-align: center;
   padding: 10px;
   border: 2px inset #ccc;
}
.found
{
   background-color: #ff0;
}
</style>
<script type="text/javascript">
//<![CDATA[

window.onload=function() {
   document.getElementById("searchSubmit").onclick=doSearch;
}

function doSearch() {
   // get pattern
   var pattern = document.getElementById("pattern").value;
   var re = new RegExp(pattern,"g");

   // get string
   var searchString = document.getElementById("incoming").value;

   var matchArray;
   var resultString = "<pre>";
   var first=0; var last=0;

   // find each match
   while((matchArray = re.exec(searchString)) != null) {
     last = matchArray.index;
     // get all of string up to match, concatenate
     resultString += searchString.substring(first, last);

     // add matched, with class
     resultString += "<span class='found'>" + matchArray[0] + "</span>";
     first = re.lastIndex;
   }

   // finish off string
   resultString += searchString.substring(first,searchString.length);
   resultString += "</pre>";

   // insert into page
   document.getElementById("searchResult").innerHTML = resultString;
}

//--><!]]>
</script>
</head>
<body>
<form id="textsearch">
<textarea id="incoming" cols="150" rows="10">
</textarea>
<p>
Search pattern: <input id="pattern" type="text" /></p>
</form>
<p id="searchSubmit">Search for pattern</p>
<div id="searchResult"></div>
</body>
</html>

Figure 2-1 shows the application in action on William Wordsworth’s poem, “The Kitten and the Falling Leaves,” after a search for the following pattern:

lea(f|ve)

The bar (|) is a conditional test, and will match a word based on the value on either side of the bar. So a word like leaf matches, as well as a word like leave, but not a word like leap.

Application finding and highlighting all matched strings
Figure 2-1. Application finding and highlighting all matched strings

You can access the last index found through the RegExp’s lastIndex property. The lastIndex property is handy if you want to track both the first and last matches.

See Also

Recipe 2.5 describes another way to do a standard find-and-replace behavior, and Recipe 2.6 provides a simpler approach to finding and highlighting text in a string.

2.5. Replacing Patterns with New Strings

Problem

You want to replace all matched substrings with a new substring.

Solution

Use the String object’s replace method, with a regular expression:

var searchString = "Now is the time, this is the time";
var re = /t\w{2}e/g;
var replacement = searchString.replace(re, "place");
alert(replacement); // Now is the place, this is the place

Discussion

In Example 2-1 in Recipe 2.4, we used the RegExp global flag (g) in order to track each occurrence of the regular expression. Each match was highlighted using a span element and CSS.

A global search is also handy for a typical find-and-replace behavior. Using the global flag (g) with the regular expression in combination with the String replace method will replace all instances of the matched text with the replacement string.

See Also

Recipe 2.6 demonstrates variations of using regular expressions with the String replace method.

2.6. Swap Words in a String Using Capturing Parentheses

Problem

You want to accept an input string with first and last name, and swap the names so the last name is first.

Solution

Use capturing parentheses and a regular expression to find and remember the two names in the string, and reverse them:

var name = "Abe Lincoln";
var re = /^(\w+)\s(\w+)$/;
var newname = name.replace(re,"$2, $1");

Discussion

Capturing parentheses allow us to not only match specific patterns in a string, but to reference the matched substrings at a later time. The matched substrings are referenced numerically, from left to right, as represented by the use of “$1” and “$2” in the String replace method.

In the solution, the regular expression matches two words, separated by a space. Capturing parentheses were used with both words, so the first name is accessible using “$1”, the last name with “$2”.

The captured parentheses aren’t the only special characters available with the String replace method. Table 2-3 shows the other special characters that can be used with regular expressions and replace.

Table 2-3. String.replace special patterns

Pattern

Purpose

$$

Allows a literal dollar sign ($) in replacement

$&

Inserts matched substring

$`

Inserts portion of string before match

$’

Inserts portion of string after match

$n

Inserts nth captured parenthetical value when using RegExp

The second table entry, which reinserts the matched substring, can be used to provide a simplified version of the Example 2-1 application in Recipe 2.4. That example found and provided markup and CSS to highlight the matched substring. It used a loop to find and replace all entries, but in Example 2-2 we’ll use the String replace method with the matched substring special pattern ($&)

Example 2-2. Using String.replace and special pattern to find and highlight text in a string
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Searching for strings</title>
<style>
#searchSubmit
{
   background-color: #ff0;
   width: 200px;
   text-align: center;
   padding: 10px;
   border: 2px inset #ccc;
}
.found
{
   background-color: #ff0;
}
</style>
<script>
//<![CDATA[

window.onload=function() {
   document.getElementById("searchSubmit").onclick=doSearch;
}

function doSearch() {
   // get pattern
   var pattern = document.getElementById("pattern").value;
   var re = new RegExp(pattern,"g");

   // get string
   var searchString = document.getElementById("incoming").value;

   // replace
   var resultString = searchString.replace(re,"<span class='found'>$&</span>");

   // insert into page
   document.getElementById("searchResult").innerHTML = resultString;
}

//--><!]]>
</script>
</head>
<body>
<form id="textsearch">
<textarea id="incoming" cols="100" rows="10">
</textarea>
<p>
Search pattern: <input id="pattern" type="text" /></p>
</form>
<p id="searchSubmit">Search for pattern</p>
<div id="searchResult"></div>
</body>
</html>

This is a simpler alternative, but as Figure 2-2 shows, this technique doesn’t quite preserve all aspects of the original string. The line feeds aren’t preserved with Example 2-2, but they are with Example 2-1.

The captured text can also be accessed via the RegExp object when you use the RegExp exec method. Now let’s return to the Recipe 2.6 solution code, but this time using the RegExp’s exec method:

var name = "Shelley Powers";
var re = /^(\w+)\s(\w+)$/;
var result = re.exec(name);
var newname = result[2] + ", " + result[1];

This approach is handy if you want to access the capturing parentheses values, but without having to use them within a string replacement. To see another example of using capturing parentheses, Recipe 1.7 demonstrated a couple of ways to access the list of items in the following sentence, using the String split method:

var sentence = "This is one sentence. This is a sentence with a list of items:
cherries, oranges, apples, bananas.";

Another approach is the following, using capturing parentheses, and the RegExp exec method:

var re = /:(.*)\./;
var result = re.exec(sentence);
var list = result[1]; // cherries, oranges, apples, bananas
Using to find and highlight text in a string
Figure 2-2. Using Example 2-2 to find and highlight text in a string

2.7. Using Regular Expressions to Trim Whitespace

Problem

Before sending a string to the server via an Ajax call, you want to trim whitespace from the beginning and end of the string.

Solution

Prior to the new ECMAScript 5 specification, you could use a regular expression to trim whitespace from the beginning and end of a string:

var testString = "   this is the string    ";

// trim white space from the beginning
testString = testString.replace(/^\s+/,"");

// trim white space from the end
testString = testString.replace(/\s+$/,"");

Beginning with ECMAScript 5, the String object now has a trim method:

var testString = "    this is the string    ";
testString = testString.trim(); // white space trimmed

Discussion

String values retrieved from form elements can sometimes have whitespace before and after the actual form value. You don’t usually want to send the string with the extraneous whitespace, so you’ll use a regular expression to trim the string.

Beginning with ECMAScript 5, there’s now a String trim method. However, until ECMAScript 5 has wider use, you’ll want to check to see if the trim method exists, and if not, use the old regular expression method as a fail-safe method.

In addition, there is no left or right trim in ECMAScript 5, though there are nonstandard versions of these methods in some browsers, such as Firefox. So if you want left- or right-only trim, you’ll want to create your own functions:

function leftTrim(str) {
   return str.replace(/^\s+/,"");
}
function rightTrim(str) {
   return str.replace(/\s+$/,"");
}

2.8. Replace HTML Tags with Named Entities

Problem

You want to paste example markup into a web page, and escape the markup—have the angle brackets print out rather than have the contents parsed.

Solution

Use regular expressions to convert angle brackets (<>) into the named entities &lt; and &gt;:

var pieceOfHtml = "<p>This is a <span>paragraph</span></p>";
pieceOfHtml = pieceOfHtml.replace(/</g,"&lt;");
pieceOfHtml = pieceOfHtml.replace(/>/g,"&gt;");
document.getElementById("searchResult").innerHTML = pieceOfHtml;

Discussion

It’s not unusual to want to paste samples of markup into another web page. The only way to have the text printed out, as is, without having the browser parse it, is to convert all angle brackets into their equivalent named entities.

The process is simple with the use of regular expressions, using the regular expression global flag (g) and the String replace method, as demonstrated in the solution.

2.9. Searching for Special Characters

Problem

We’ve searched for numbers and letters, and anything not a number or other character, but one thing we need to search is the special regular expression characters themselves.

Solution

Use the backslash to escape the pattern-matching character:

var re = /\\d/;
var pattern = "\\d{4}";
var pattern2 = pattern.replace(re,"\\D");

Discussion

In the solution, a regular expression is created that’s equivalent to the special character, \d, used to match on any number. The pattern is, itself, escaped, in the string that needs to be searched. The number special character is then replaced with the special character that searches for anything but a number, \D.

Sounds a little convoluted, so I’ll demonstrate with a longer application. Example 2-3 shows a small application that first searches for a sequence of four numbers in a string, and replaces them with four asterisks (****). Next, the application will modify the search pattern, by replacing the \d with \D, and then running it against the same string.

Example 2-3. Regular expression matching on regular expression characters
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Replacement Insanity</title>
<script>
//<![CDATA[

window.onload=function() {

  // search for \d
  var re = /\\d/;
  var pattern = "\\d{4}";
  var str = "I want 1111 to find 3334 certain 5343 things 8484";
  var re2 = new RegExp(pattern,"g");
  var str1 = str.replace(re2,"****");
  alert(str1);
  var pattern2 = pattern.replace(re,"\\D");
  var re3 = new RegExp(pattern2,"g");
  var str2 = str.replace(re3, "****");
  alert(str2);
}
//--><!]]>
</script>
</head>
<body>
<p>content</p>
</body>
</html>

Here is the original string:

I want 1111 to find 3334 certain 5343 things 8484

The first string printed out is the original string with the numbers converted into asterisks:

I want **** to find **** certain **** things ****

The second string printed out is the same string, but after the characters have been converted into asterisks:

****nt 1111******** 3334******** 5343********8484

Though this example is short, it demonstrates some of the challenges when you want to search on regular expression characters themselves.

Get JavaScript Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.