Now it’s time to take a brief detour on our trip through Java and
enter the land of regular expressions. A regular
expression, or regex for short, describes a text pattern. Regular
expressions are used with many tools—including the java.util.regex
package,
text editors, and many scripting languages—to provide sophisticated
text-searching and powerful string-manipulation capabilities.
If you are already familiar with the concept of regular expressions and how they are used with other languages, you may wish to skim through this section. At the very least, you’ll need to look at the “The java.util.regex API” section later in this chapter, which covers the Java classes necessary to use them. On the other hand, if you’ve come to this point on your Java journey with a clean slate on this topic and you’re wondering exactly what regular expressions are, then pop open your favorite beverage and get ready. You are about to learn about the most powerful tool in the arsenal of text manipulation and what is, in fact, a tiny language within a language, all in the span of a few pages.
A regular expression describes a pattern in text. By pattern, we mean just about any feature you can imagine identifying in text from the literal characters alone, without actually understanding their meaning. This includes features, such as words, word groupings, lines and paragraphs, punctuation, case, and more generally, strings and numbers with a specific structure to them, such as phone numbers, email addresses, and quoted phrases. With regular expressions, you can search the dictionary for all the words that have the letter “q” without its pal “u” next to it, or words that start and end with the same letter. Once you have constructed a pattern, you can use simple tools to hunt for it in text or to determine if a given string matches it. A regex can also be arranged to help you dismember specific parts of the text it matched, which you could then use as elements of replacement text if you wish.
Before moving on, we should say a few words about regular expression syntax in general. At the beginning of this section, we casually mentioned that we would be discussing a new language. Regular expressions do, in fact, constitute a simple form of programming language. If you think for a moment about the examples we cited earlier, you can see that something like a language is going to be needed to describe even simple patterns—such as email addresses—that have some variation in form.
A computer science textbook would classify regular expressions at the bottom of the hierarchy of computer languages, in terms of both what they can describe and what you can do with them. They are still capable of being quite sophisticated, however. As with most programming languages, the elements of regular expressions are simple, but they can be built up in combination to arbitrary complexity. And that is where things start to get sticky.
Since regexes work on strings, it is convenient to have a very compact notation that can be easily wedged between characters. But compact notation can be very cryptic, and experience shows that it is much easier to write a complex statement than to read it again later. Such is the curse of the regular expression. You may find that in a moment of late-night, caffeine-fueled inspiration, you can write a single glorious pattern to simplify the rest of your program down to one line. When you return to read that line the next day, however, it may look like Egyptian hieroglyphics to you. Simpler is generally better. If you can break your problem down and do it more clearly in several steps, maybe you should.
Now that you’re properly warned, we have to throw one
more thing at you before we build you back up. Not only can the regex
notation get a little hairy, but it is also somewhat ambiguous with
ordinary Java strings. An important part of the notation is the
escaped character, a character with a backslash in front of it. For
example, the escaped d
character,
\d
, (backslash ‘d’)
is shorthand that matches any single digit character (0-9). However,
you cannot simply write \d
as part
of a Java string, because Java uses the backslash for its own special
characters and to specify Unicode character sequences (\uxxxx
). Fortunately, Java gives us a
replacement: an escaped backslash, which is two backslashes (\\),
means a literal backslash. The rule is, when you want a backslash to
appear in your regex, you must escape it with an extra one:
"\\d"
// Java string that yields backslash "d"
And just to make things crazier, because regex notation itself uses backslash to denote special characters, it must provide the same “escape hatch” as well—allowing you to double up backslashes if you want a literal backslash. So if you want to specify a regular expression that includes a single literal backslash, it looks like this:
"\\\\"
// Java string yields two backslashes; regex yields one
Most of the “magic” operator characters you read about in this
section operate on the character that precedes them, so these also
must be escaped if you want their literal meaning. This includes such
characters as .
, *
, +
, braces {}
, and parentheses
()
.
If you need to create part of an expression that has lots of
literal characters in it, you can use the special delimiters
\Q
and \E
to help you. Any
text appearing between \Q
and
\E
is automatically escaped. (You
still need the Java String
escapes—double backslashes for backslash, but not quadruple.) There is
also a static method Pattern.quote()
,
which does the same thing, returning a properly escaped version of
whatever string you give
it.
Beyond that, my only suggestion to help maintain your sanity when working with these examples is to keep two copies—a comment line showing the naked regular expression and the real Java string, where you must double up all backslashes.
Now, let’s dive into the actual regex syntax. The
simplest form of a regular expression is plain, literal text, which
has no special meaning and is matched directly (character for
character) in the input. This can be a single character or more. For
example, in the following string, the pattern “s” can match the
character s
in the words rose
and is
:
"A rose is $1.99."
The pattern “rose” can match only the literal word rose
. But this isn’t very interesting. Let’s
crank things up a notch by introducing some special characters and the
notion of character “classes.”
- Any character: dot (
.
) The special character dot (
.
) matches any single character. The pattern “.ose” matches rose, nose, _ose (space followed by ose) or any other character followed by the sequence ose. Two dots match any two characters, and so on. The dot operator is not discriminating; it normally stops only for an end-of-line character (and, optionally, you can tell it not to; we discuss that later).We can consider “.” to represent the group or class of all characters. And regexes define more interesting character classes as well.
- Whitespace or nonwhitespace character:
\s
,\S
The special character
\s
matches a literal-space character or one of the following characters:\t
(tab),\r
(carriage return),\n
(newline),\f
(formfeed), and backspace. The corresponding special character\S
does the inverse, matching any character except whitespace.- Digit or nondigit character:
\d
,\D
\d
matches any of the digits 0-9.\D
does the inverse, matching all characters except digits.- Word or nonword character:
\w
,\W
\w
matches a “word” character, including upper- and lowercase letters A-Z, a-z, the digits 0-9, and the underscore character (_).\W
matches everything except those characters.
You can define your own character classes using the notation [...]. For example, the following class matches any of the characters a, b, c, x, y, or z:
[
abcxyz
]
The special x-y range notation can be used as shorthand for the alphabetic characters. The following example defines a character class containing all upper- and lowercase letters:
[
A
-
Za
-
z
]
Placing a caret (^) as the first character inside the brackets inverts the character class. This example matches any character except uppercase A-F:
[^
A
-
F
]
// G, H, I, ..., a, b, c, ... etc.
Nesting character classes simply adds them:
[
A
-
F
[
G
-
Z
]]
// A-Z
The && logical AND notation can be used to take the intersection (characters in common):
[
a
-
p
&&[
l
-
z
]]
// l, m, n, o, p
[
A
-
Z
&&[^
P
]]
// A through Z except P
The pattern “[Aa] rose” (including an upper- or lowercase A) matches three times in the following phrase:
"A rose is a rose is a rose"
Position characters allow you to designate the relative location
of a match. The most important are ^
and $
, which match the
beginning and end of a line, respectively:
^[
Aa
]
rose
// matches "A rose" at the beginning of line
[
Aa
]
rose$
// matches "a rose" at end of line
By default, ^
and $
match the beginning and end of “input,”
which is often a line. If you are working with multiple lines of text
and wish to match the beginnings and endings of lines within a single
large string, you can turn on “multiline” mode as described later in
this chapter.
The position markers \b
and \B
match a word
boundary or nonword boundary, respectively. For example, the following
pattern matches rose and rosemary, but not primrose:
\
brose
Simply matching fixed character patterns would not get us very far. Next, we look at operators that count the number of occurrences of a character (or more generally, of a pattern, as we’ll see in Capture groups):
- Any (zero or more iterations): asterisk
(
*
) Placing an asterisk (*) after a character or character class means “allow any number of that type of character”—in other words, zero or more. For example, the following pattern matches a digit with any number of leading zeros (possibly none):
0
*
\
d
// match a digit with any number of leading zeros
- Some (one or more iterations): plus
sign (
+
) The plus sign (+) means “one or more” iterations and is equivalent to XX* (pattern followed by pattern asterisk). For example, the following pattern matches a number with one or more digits, plus optional leading zeros:
0
*
\
d
+
// match a number (one or more digits) with optional leading
// zeros
It may seem redundant to match the zeros at the beginning of an expression because zero is a digit and is thus matched by the
\d+
portion of the expression anyway. However, we’ll show later how you can pick apart the string using a regex and get at just the pieces you want. In this case, you might want to strip off the leading zeros and keep only the digits.- Optional (zero or one iteration): question
mark (
?
) The question mark operator (?) allows exactly zero or one iteration. For example, the following pattern matches a credit-card expiration date, which may or may not have a slash in the middle:
\
d
\
d
/?
\
d
\
d
// match four digits with an optional slash in the middle
- Range (between x and y iterations,
inclusive):
{x,y}
The
{x,y}
curly-brace range operator is the most general iteration operator. It specifies a precise range to match. A range takes two arguments: a lower bound and an upper bound, separated by a comma. This regex matches any word with five to seven characters, inclusive:\
b
\
w
{
5
,
7
}
\
b
// match words with at least 5 and at most 7 characters
- At least x or more iterations (y is
infinite):
{x,}
If you omit the upper bound, simply leaving a dangling comma in the range, the upper bound becomes infinite. This is a way to specify a minimum of occurrences with no maximum.
Just as in logical or mathematical operations, parentheses can be used in regular expressions to make subexpressions or to put boundaries on parts of expressions. This power lets us extend the operators we’ve talked about to work not only on characters, but also on words or other regular expressions. For example:
(
yada
)+
Here we are applying the + (one or more) operator to the whole
pattern yada
, not just one
character. It matches yada, yadayada, yadayadayada, and so on.
Using grouping, we can start building more complex expressions. For example, while many email addresses have a three-part structure (e.g., foo@bar.com), the domain name portion can, in actuality, contain an arbitrary number of dot-separated components. To handle this properly, we can use an expression like this one:
\
w
+
@\
w
+(
\
.
\
w
)+
// Match an email address
This expression matches a word, followed by an @
symbol, followed by another word and then
one or more literal dot-separated words—e.g.,
pat@pat.net, friend@foo.bar.com, or
mate@foo.bar.co.uk.
In addition to basic grouping of operations, parentheses have an important, additional role: the text matched by each parenthesized subexpression can be separately retrieved. That is, you can isolate the text that matched each subexpression. There is then a special syntax for referring to each capture group within the regular expression by number. This important feature has two uses.
First, you can construct a regular expression that refers to the text it has already matched and uses this text as a parameter for further matching. This allows you to express some very powerful things. For example, we can show the dictionary example we mentioned in the introduction. Let’s find all the words that start and end with the same letter:
\
b
(
\
w
)
\
w
*
\
1
\
b
// match words beginning and ending with the same letter
See the 1
in this expression?
It’s a reference to the first capture group in the expression,
(\w)
. References to capture groups
take the form \
n
where
n
is the number of the capture group,
counting from left to right. In this example, the first capture group
matches a word character on a word boundary. Then we allow any number
of word characters up to the special reference \1
(also followed by a word boundary). The
\1
means “the value matched in
capture group one.” Because these characters must be the same, this
regex matches words that start and end with the same character.
The second use of capture groups is in referring to the matched portions of text while constructing replacement text. We’ll show you how to do that a bit later when we talk about the Regular Expression API.
Capture groups can contain more than one character, of course, and you can have any number of groups. You can even nest capture groups. Next, we discuss exactly how they are numbered.
Capture groups are numbered, starting at 1, and moving from left to right, by counting the number of open parentheses it takes to reach them. The special group number 0 always refers to the entire expression match. For example, consider the following string:
one
((
two
)
(
three
(
four
)))
This string creates the following matches:
Group
0
:
one
two
three
four
Group
1
:
two
three
four
Group
2
:
two
Group
3
:
three
four
Group
4
:
four
Before going on, we should note one more thing. So far in this
section we’ve glossed over the fact that parentheses are doing double
duty: creating logical groupings for operations and defining capture
groups. What if the two roles conflict? Suppose we have a complex
regex that uses parentheses to group subexpressions and to create
capture groups? In that case, you can use a special noncapturing group
operator (?:)
to do logical
grouping instead of using parentheses. You probably won’t need to do
this often, but it’s good to know.
The vertical bar (|) operator denotes the logical OR operation, also called alternation or choice. The | operator does not operate on individual characters but instead applies to everything on either side of it. It splits the expression in two unless constrained by parentheses grouping. For example, a slightly naive approach to parsing dates might be the following:
\
w
+,
\
w
+
\
d
+
\
d
+|
\
d
\
d
/
\
d
\
d
/
\
d
\
d
// pattern 1 or pattern 2
In this expression, the left matches patterns such as Fri, Oct 12, 2001, and the right matches 10/12/2001.
The following regex might be used to match email addresses with one of three domains (net, edu, and gov):
\
w
+
@
[
\
w
\
.]*
\
.(
net
|
edu
|
gov
)
// email address ending in .net, .edu, or .gov
There are several special options that affect the way the regex engine performs its matching. These options can be applied in two ways:
You can pass in one or more flags during the
Pattern.compile()
step (discussed later in this chapter).You can include a special block of code in your regex.
We’ll show the latter approach here. To do this, include one or
more flags in a special block (?
x
)
, where x
is the
flag for the option we want to turn on. Generally, you do this at the
beginning of the regex. You can also turn off flags by adding a minus
sign (?-
x
)
, which allows you to apply flags to select
parts of your pattern.
The following flags are available:
- Case-insensitive:
(?i)
The
(?i)
flag tells the regex engine to ignore case while matching, for example:(?
i
)
yahoo
// match Yahoo, yahoo, yahOO, etc.
- Dot all:
(?s)
The
(?s)
flag turns on “dot all” mode, allowing the dot character to match anything, including end-of-line characters. It is useful if you are matching patterns that span multiple lines. Thes
stands for “single-line mode,” a somewhat confusing name derived from Perl.- Multiline:
(?m)
By default,
^
and$
don’t really match the beginning and end of lines (as defined by carriage return or newline combinations); they instead match the beginning or end of the entire input text. Turning on multiline mode with(?m)
causes them to match the beginning and end of every line as well as the beginning and end of input. Specifically, this means the spot before the first character, the spot after the last character, and the spots just after and before line terminators inside the string.- Unix lines:
(?d)
The (
?d)
flag limits the definition of the line terminator for the^
,$
, and.
special characters to Unix-style newline only (\n
). By default, carriage return newline (\r\n
) is also allowed.
We’ve seen hints that regular expressions are capable of
sorting some complex patterns. But there are cases where what should
be matched is ambiguous (at least to us, though not to the regex
engine). Probably the most important example has to do with the number
of characters the iterator operators consume before stopping. The
.*
operation best
illustrates this. Consider the following string:
"Now is the time for <bold>action</bold>, not words."
Suppose we want to search for all the HTML-style tags (the parts between the < and > characters), perhaps because we want to remove them.
We might naively start with this regex:
</?.*>
// match <, optional /, and then anything up to >
We then get the following match, which is much too long:
<
bold
>
action
</
bold
>
The problem is that the .*
operation, like all the iteration operators, is by default “greedy,”
meaning that it consumes absolutely everything it can, up until the
last match for the terminating character (in this case, >) in the
file or line.
There are solutions for this problem. The first is to “say what it is”—that is, to be specific about what is allowed between the braces. The content of an HTML tag cannot actually include anything; for example, it cannot include a closing bracket (>). So we could rewrite our expression as:
</?
\
w
*>
// match <, optional /, any number of word characters, then >
But suppose the content is not so easy to describe. For example, we might be looking for quoted strings in text, which could include just about any text. In that case, we can use a second approach and “say what it is not.” We can invert our logic from the previous example and specify that anything except a closing bracket is allowed inside the brackets:
</?[^>]*>
This is probably the most efficient way to tell the regex engine what to do. It then knows exactly what to look for to stop reading. This approach has limitations, however. It is not obvious how to do this if the delimiter is more complex than a single character. It is also not very elegant.
Finally, we come to our general solution: the use of “reluctant” operators. For each of the iteration operators, there is an alternative, nongreedy form that consumes as few characters as possible, while still trying to get a match with what comes after it. This is exactly what we needed in our previous example.
Reluctant operators take the form of the standard operator with a “?” appended. (Yes, we know that’s confusing.) We can now write our regex as:
</?.*?>
// match <, optional /, minimum number of any chars, then >
We have appended ?
to .*
to cause .*
to match as few characters as possible
while still making the final match of >. The same technique
(appending the ?
) works with all
the iteration operators, as in the two following examples:
.+?
// one or more, nongreedy
.{
x
,
y
}?
// between x and y, nongreedy
In order to understand our next topic, let’s return for
a moment to the position marking characters (^
, $
,
\b
, and \B
) that we discussed earlier. Think about
what exactly these special markers do for us. We say, for example,
that the \b
marker matches a word
boundary. But the word “match” here may be a bit too strong. In
reality, it “requires” a word boundary to appear at the specified
point in the regex. Suppose we didn’t have \b
; how could we construct it? Well, we
could try constructing a regex that matches the word boundary. It
might seem easy, given the word and nonword character classes
(\w
and \W
):
\
w
\
W
|
\
W
\
w
// match the start or end of a word
But now what? We could try inserting that pattern into our
regular expressions wherever we would have used \b
, but it’s not really the same. We’re
actually matching those characters, not just requiring them. This
regular expression matches the two characters composing the word
boundary in addition to whatever else matches
afterward, whereas the \b
operator
simply requires the word boundary but doesn’t
match any text. The distinction is that \b
isn’t a matching pattern but a kind of
lookahead. A lookahead is a pattern that is
required to match next in the string, but is not consumed by the regex
engine. When a lookahead pattern succeeds, the pattern moves on, and
the characters are left in the stream for the next part of the pattern
to use. If the lookahead fails, the match fails (or it backtracks and
tries a different approach).
We can make our own lookaheads with the lookahead operator
(?=)
. For example, to
match the letter X at the end of a word, we could use:
(?=
\
w
\
W
)
X
// Find X at the end of a word
Here the regex engine requires the \W\w
pattern to match but not consume the
characters, leaving them for the next part of the pattern. This
effectively allows us to write overlapping patterns (like the previous
example). For instance, we can match the word “Pat” only when it’s
part of the word “Patrick,” like so:
(?=
Patrick
)
Pat
// Find Pat only in Patrick
Another operator, (?!)
, the
negative lookahead, requires that the pattern not
match. We can find all the occurrences of Pat not inside of a Patrick
with this:
(?!
Patrick
)
Pat
// Find Pat never in Patrick
It’s worth noting that we could have written all of these examples in other ways, by simply matching a larger amount of text. For instance, in the first example we could have matched the whole word “Patrick.” But that is not as precise, and if we wanted to use capture groups to pull out the matched text or parts of it later, we’d have to play games to get what we want. For example, suppose we wanted to substitute something for Pat (say, change the font). We’d have to use an extra capture group and replace the text with itself. Using lookaheads is easier.
In addition to looking ahead in the stream, we can use the
(?<=)
and
(?<!)
lookbehind operators to
look backward in the stream. For example, we can find my last name,
but only when it refers to me:
(?<=
Pat
)
Niemeyer
// Niemeyer, only when preceded by Pat
Or we can find the string “bean” when it is not part of the phrase “Java bean”:
(?<!
Java
*)
bean
// The word bean, not preceded by Java
In these cases, the lookbehind and the matched text didn’t overlap because the lookbehind was before the matched text. But you can place a lookahead or lookbehind at either point—before or after the match—for example, we could also match Pat Niemeyer like this:
Niemeyer
(?<=
Pat
Niemeyer
)
Now that we’ve covered the theory of how to construct regular expressions, the hard part is over. All that’s left is to investigate the Java API for applying regexes: searching for them in strings, retrieving captured text, and replacing matches with substitution text.
As we’ve said, the regex patterns that we write as
strings are, in actuality, little programs describing how to match
text. At runtime, the Java regex package compiles these little
programs into a form that it can execute against some target text.
Several simple convenience methods accept strings directly to use as
patterns. More generally, however, Java allows you to explicitly
compile your pattern and encapsulate it in an instance of a Pattern
object. This is the most efficient
way to handle patterns that are used more than once, because it
eliminates needlessly recompiling the string. To compile a pattern, we
use the static method Pattern.compile()
:
Pattern
urlPattern
=
Pattern
.
compile
(
"\\w+://[\\w/]*"
);
Once you have a Pattern
, you
can ask it to create a Matcher
object, which
associates the pattern with a target string:
Matcher
matcher
=
urlPattern
.
matcher
(
myText
);
The matcher executes the matches. We’ll talk about that next.
But before we do, we’ll just mention one convenience method of
Pattern
. The static method Pattern.matches()
simply takes two strings—a
regex and a target string—and determines if the target matches the
regex. This is very convenient if you want to do a quick test once in
your application. For example:
Boolean
match
=
Pattern
.
matches
(
"\\d+\\.\\d+f?"
,
myText
);
This line of code can test if the string myText
contains a Java-style floating-point
number such as “42.0f.” Note that the string must match completely in
order to be considered a match.
A Matcher
associates
a pattern with a string and provides tools for testing, finding, and
iterating over matches of the pattern against it. The Matcher
is “stateful.” For example, the
find()
method tries
to find the next match each time it is called. But you can clear the
Matcher
and start over by calling
its reset()
method.
If you’re just interested in “one big match”—that is, you’re
expecting your string to either match the pattern or not—you can use
matches()
or
lookingAt()
. These
correspond roughly to the methods equals()
and startsWith()
of the String
class. The matches()
method asks if the string matches
the pattern in its entirety (with no string characters left over) and
returns true
or false
. The lookingAt()
method does the same, except
that it asks only whether the string starts with the pattern and
doesn’t care if the pattern uses up all the string’s
characters.
More generally, you’ll want to be able to search through the
string and find one or more matches. To do this, you can use the
find()
method. Each call to
find()
returns true
or false
for the next match of the pattern and
internally notes the position of the matching text. You can get the
starting and ending character positions with the Matcher start()
and end()
methods, or you can simply retrieve
the matched text with the group()
method. For
example:
import
java.util.regex.*
;
String
text
=
"A horse is a horse, of course of course..."
;
String
pattern
=
"horse|course"
;
Matcher
matcher
=
Pattern
.
compile
(
pattern
).
matcher
(
text
);
while
(
matcher
.
find
()
)
System
.
out
.
println
(
"Matched: '"
+
matcher
.
group
()+
"' at position "
+
matcher
.
start
()
);
The previous snippet prints the starting location of the words “horse” and “course” (four in all):
Matched:
'
horse
'
at
position
2
Matched:
'
horse
'
at
position
13
Matched:
'
course
'
at
position
23
Matched:
'
course
'
at
position
33
The method to retrieve the matched text is called group()
because it refers to capture group
zero (the entire match). You can also retrieve the text of other
numbered capture groups by giving the group()
method an integer argument. You can
determine how many capture groups you have with the groupCount()
method:
for
(
int
i
=
1
;
i
<
matcher
.
groupCount
();
i
++)
System
.
out
.
println
(
matcher
.
group
(
i
)
);
A very common need is to parse a string into a bunch of
fields based on some delimiter, such as a comma. It’s such a common
problem that in Java 1.4, a method was added to the String
class for doing just this. The
split()
method
accepts a regular expression and returns an array of substrings broken
around that pattern. For example:
String
text
=
"Foo, bar , blah"
;
String
[]
fields
=
text
.
split
(
"\s*,\s*"
);
yields a String
array
containing Foo
, bar
, and blah
. You can control the maximum number of
matches and also whether you get “empty” strings (for text that might
have appeared between two adjacent delimiters) using an optional limit
field.
If you are going to use an operation like this more than a few
times in your code, you should probably compile the pattern and use
its split()
method, which is
identical to the version in String
.
The String split()
method is
equivalent to:
Pattern
.
compile
(
pattern
).
split
(
string
);
As we mentioned when we introduced it, the Scanner
class in Java 5.0 can use regular
expressions to tokenize strings. You can specify a regular expression
to use as the delimiter (instead of the default whitespace) either at
construction time or with the useDelimiter()
method. The Scanner next()
, hasNext()
, skip()
, and findInLine()
methods all take regular
expressions as well. You can specify these either as strings or with a
compiled Pattern
object.
You can use the findInLine()
method
of Scanner
as an improved Matcher
. For example:
Scanner
scanner
=
new
Scanner
(
"Quantity: 42 items, Price $2.34"
);
scanner
.
findInLine
(
"[Qq]uantity[:\\s]*"
);
int
quantity
=
scanner
.
nextInt
();
scanner
.
findInLine
(
"[Pp]rice.*\\$"
);
float
price
=
scanner
.
nextFloat
();
The previous snippet locates the quantity and price values, allowing for variations in capitalization and spacing before the numbers.
Before we move on, we’ll also mention a “Stupid Scanner Trick”
that, although we don’t recommend it, you might find amusing. Using
the \A
boundary marker, which
denotes the beginning of input, as a delimiter, we can tell the
Scanner
to return the whole input
as a single string. This is an easy way to read the contents of any
stream into one large string:
InputStream
source
=
new
URL
(
"http://www.oreilly.com/"
).
openStream
();
String
text
=
new
Scanner
(
source
).
useDelimiter
(
"\\A"
).
next
();
This is probably not the most efficient or understandable way to do it, but it may save you a little typing in your experimentation.
A common reason that you’ll find yourself searching for a pattern in a string is to change it to something else. The regex package not only makes it easy to do this but also provides a simple notation to help you construct replacement text using bits of the matched text.
The most convenient form of this API
is Matcher
’s replaceAll()
method,
which substitutes a replacement string for each occurrence of the
pattern and returns the result. For example:
String
text
=
"Richard Nixon's social security number is: 567-68-0515."
;
Matcher
matcher
=
Pattern
.
compile
(
"\\d\\d\\d-\\d\\d\-\\d\\d\\d\\d"
).
matcher
(
text
);
String
output
=
matcher
.
replaceAll
(
"XXX-XX-XXXX"
);
This code replaces all occurrences of U.S. government Social Security numbers with “XXX-XX-XXXX” (perhaps for privacy considerations).
. Literal substitution is nice, but we can make this more powerful by using capture groups in our substitution pattern. To do this, we use the simple convention of referring to numbered capture groups with the notation $n, where n is the group number. For example, suppose we wanted to show just a little of the Social Security number in the previous example, so that the user would know if we were talking about him. We could modify our regex to catch, for example, the last four digits like so:
\
d
\
d
\
d
-
\
d
\
d
-(
\
d
\
d
\
d
\
d
)
We can then use that in the substitution text:
String
output
=
matcher
.
replaceAll
(
"XXX-XX-$1"
);
The static method Matcher.quoteReplacement()
can be used to
escape a literal string (so that it ignores the $ notation) before
using it as replacement text.
The replaceAll()
method is
useful, but you may want more control over each substitution. You
may want to change each match to something different or base the
change on the match in some programmatic way.
To do this, you can use the Matcher
appendReplacement()
and appendTail()
methods. These methods can be
used in conjunction with the find()
method as you iterate through
matches to build a replacement string. appendReplacement()
and appendTail()
operate on a StringBuffer
that
you supply. The appendReplacement()
method builds a
replacement string by keeping track of where you are in the text and
appending all nonmatched text to the buffer for
you as well as the substitute text that you supply. Each call to
find()
appends the intervening
text from the last call, followed by your replacement, then skips
over all the matched characters to prepare for the next one.
Finally, when you have reached the last match, you should call
appendTail()
, which appends any
remaining text after the last match. We’ll show an example of this
next, as we build a simple “template engine.”
Let’s tie what we’ve discussed together in a nifty example. A common problem in Java applications is working with bulky, multiline text. In general, you don’t want to store the text of messages in your application code because it makes them difficult to edit or internationalize. But when you move them to external files or resources, you need a way for your application to plug in information at runtime. The best example of this is in Java servlets; a generated HTML page is often 99% static text with only a few “variable” pieces plugged in. Technologies such as JSP and XSL were developed to address this. But these are big tools, and we have a simple problem. So let’s create a simple solution—a template engine.
Our template engine reads text containing special template tags and substitutes values that we provide. And because generating HTML or XML is one of the most important applications of this, we’ll be friendly to those formats by making our tags conform to the style of an XML comment. Specifically, our engine searches the text for tags that look like this:
<!--
TEMPLATE:
name
This
is
the
template
for
the
user
name
-->
XML-style comments start with <!—
and can contain anything up to a
closing —>
. We’ll add the
convention of requiring a TEMPLATE:name
field to specify the name of
the value we want to use. Aside from that, we’ll still allow any
descriptive text the user wants to include. To be friendly (and
consistent), we’ll allow any amount of whitespace to appear in the
tags, including multiline text in the comments. We’ll also ignore the
text case of the “TEMPLATE” identifier, just in case. Now, we could do
this all with low-level String
commands, looping over whitespace and taking many substrings. But
using the power of regexes, we can do it much more cleanly and with
only about seven lines of relevant code. (We’ve rounded out the
example with a few more to make it more useful.)
import
java.util.*
;
import
java.util.regex.*
;
public
class
Template
{
Properties
values
=
new
Properties
();
Pattern
templateComment
=
Pattern
.
compile
(
"(?si)<!--\\s*TEMPLATE:(\\w+).*?-->"
);
public
void
set
(
String
name
,
String
value
)
{
values
.
setProperty
(
name
,
value
);
}
public
String
fillIn
(
String
text
)
{
Matcher
matcher
=
templateComment
.
matcher
(
text
);
StringBuffer
buffer
=
new
StringBuffer
();
while
(
matcher
.
find
()
)
{
String
name
=
matcher
.
group
(
1
);
String
value
=
values
.
getProperty
(
name
);
matcher
.
appendReplacement
(
buffer
,
value
);
}
matcher
.
appendTail
(
buffer
);
return
buffer
.
toString
();
}
}
You’d use the Template
class
like this:
String
input
=
"<!-- TEMPLATE:name --> lives at "
+
"<!-- TEMPLATE:address -->"
;
Template
template
=
new
Template
();
template
.
set
(
"name"
,
"Bob"
);
template
.
set
(
"address"
,
"1234 Main St."
);
String
output
=
template
.
fillIn
(
input
);
In this code, input
is a
string containing tags for name and address. The set()
method provides the values for those
tags.
Let’s start by picking apart the regex, templatePattern
, in the example:
(?
si
)<!--
\
s
*
TEMPLATE:
(
\
w
+).*?-->
It looks scary, but it’s actually very simple. Just start
reading from left to right. First, we have the special flags
declaration (?si)
telling the regex
engine that it should be in single-line mode, with .*
matching all characters including
newlines (s
), and ignoring case
(i
). Next, there is the literal
<!—
followed by any amount of
whitespace (\s
) and the TEMPLATE:
identifier. After the colon, we
have a capture group (\w+
), which
reads our name identifier and saves it for us to retrieve later. We
allow anything (.*)
up to the
—>
, being careful to specify
that .*
should be nongreedy
(.*?)
. We don’t want .*
to consume other opening and closing
comment tags all the way to the last one, but instead to find the
smallest match (one tag).
Our fillIn()
method does the
work, accepting a template string, searching it, and “replacing” the
tag values with the values from set()
, which we have stored in a Properties
table. Each time fillIn()
is called, it creates a Matcher
to wrap the input string and get
ready to apply the pattern. It then creates a temporary StringBuffer
to hold the output and loops,
using the Matcher find()
method to
get each tag. For each match, it retrieves the value of the capture
group (group one) that holds the tag name. It looks up the
corresponding value and replaces the tag with this value in the output
string buffer using the appendReplacement()
method. (Remember that
appendReplacement()
fills in the
intervening text on each call, so we don’t have to.) All that remains
is to call appendTail()
at the end
to get the remaining text after the last match and return the string
value. That’s it!
We hope this section has shown you some of the power provided by these tools and whetted your appetite for more. Regexes allow you to work in ways you may not have considered before. Especially now, when the software world is focused on textual representations of almost everything—from data to user interfaces—via XML and HTML, having powerful text-manipulation tools is fundamental. Just remember to keep those regexes simple so you can reuse them again and again.
Get Learning Java, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.