Chapter 4. Pattern Matching with Regular Expressions
Introduction
Suppose you have been on the Internet for a few years and have been very faithful about saving all your correspondence, just in case you (or your lawyers, or the prosecution) need a copy. The result is that you have a 5 GB disk partition dedicated to saved mail. And let’s further suppose that you remember that somewhere in there is an email message from someone named Angie or Anjie. Or was it Angy? But you don’t remember what you called it or where you stored it. Obviously, you have to look for it.
But while some of you go and try to open up all 15,000,000 documents in a word processor, I’ll just find it with one simple command. Any system that provides regular expression support allows me to search for the pattern in several ways. The simplest to understand is:
Angie|Anjie|Angy
which you can probably guess means just to search for any of the variations. A more concise form (“more thinking, less typing”) is:
An[^ dn]
The syntax will become clear as we go through this chapter. Briefly, the “A” and the “n” match themselves, in effect finding words that begin with “An”, while the cryptic [^ dn]
requires the “An” to be followed by a character other than (^ means not in this context) a space (to eliminate the very common English word “an” at the start of a sentence) or “d” (to eliminate the common word “and”) or “n” (to eliminate Anne, Announcing, etc.). Has your word processor gotten past its splash screen yet? Well, it doesn’t matter, because I’ve already found the missing file. To find the answer, I just typed the command:
grep 'An[^ dn]' *
Regular expressions, or regexes for short, provide a concise and precise specification of patterns to be matched in text.
As another example of the power of regular expressions, consider the problem of bulk-updating hundreds of files. When I started with Java, the syntax for declaring array references was baseType arrayVariableName[]
. For example, a method with an array argument, such as every program’s main method, was commonly written as:
public static void main(String args[]) {
But as time went by, it became clear to the stewards of the Java language that it would be better to write it as baseType[] arrayVariableName
. For example:
public static void main(String[] args) {
This is better Java style because it associates the “array-ness” of the type with the type
itself, rather than with the local argument name, and the compiler now accepts both modes.
I wanted to change all occurrences of main
written the old way to the new way. I used
the pattern main(String [a-z]
with the grep utility described earlier to find the
names of all the files containing old-style main declarations (i.e., main(String
followed by a space and a name character rather than an open square bracket). I then used
another regex-based Unix tool, the stream editor sed, in a little shell script to change
all occurrences in those files from main(String *([a-z][a-z]*)[]
to
main(String[] $1
(the syntax used here is discussed later in this chapter). Again,
the regex-based approach was orders of magnitude faster than doing it interactively, even
using a reasonably powerful editor such as vi
or emacs
, let alone trying to use a
graphical word processor.
Historically, the syntax of regexes has changed as they get incorporated into more tools and more languages, so the exact syntax in the previous examples is not exactly what you’d use in Java, but it does convey the conciseness and power of the regex mechanism.[17]
As a third example, consider parsing an Apache web server logfile, where some fields are delimited with quotes, others with square brackets, and others with spaces. Writing ad-hoc code to parse this is messy in any language, but a well-crafted regex can break the line into all its constituent fields in one operation (this example is developed in Program: Apache Logfile Parsing).
These same time gains can be had by Java developers.
Regular expression support has been in the standard Java runtime for ages
and is well integrated (e.g., there are regex methods in
the standard class java.lang.String
and in the “new I/O” package).
There are a few other regex packages for Java, and you may occasionally
encounter code using them, but pretty well all code from this century
can be expected to use the built-in package.
The syntax of Java regexes themselves is discussed in
Regular Expression Syntax, and the syntax of the Java API for using
regexes is described in Using regexes in Java: Test for a Pattern. The remaining
recipes show some applications of regex technology in Java.
See Also
Mastering Regular Expressions by Jeffrey Friedl (O’Reilly) is the definitive guide to all the details of regular expressions. Most introductory books on Unix and Perl include some discussion of regexes; Unix Power Tools devotes a chapter to them.
Regular Expression Syntax
Solution
Consult Table 4-1 for a list of the regular expression characters.
Discussion
These pattern characters let you specify regexes of considerable power. In building
patterns, you can use any combination of ordinary text and the metacharacters, or
special characters, in Table 4-1. These can all be used in any
combination that makes sense. For example, a+
means any number of occurrences of the
letter a
, from one up to a million or a gazillion. The pattern Mrs?
\. matches Mr.
or
Mrs.
And .*
means “any character, any number of times,” and is similar in meaning to
most command-line interpreters’ meaning of the \*
alone. The pattern \d+
means any
number of numeric digits. \d{2,3}
means a two- or three-digit number.
Subexpression | Matches | Notes |
General | ||
| Start of line/string | |
| End of line/string | |
| Word boundary | |
| Not a word boundary | |
| Beginning of entire string | |
| End of entire string | |
| End of entire string (except allowable final line terminator) | |
. | Any one character (except line terminator) | |
| “Character class”; any one character from those listed | |
| Any one character not from those listed | |
Alternation and Grouping | ||
| Grouping (capture groups) | |
| Alternation | |
| Noncapturing parenthesis | |
| End of the previous match | |
| Back-reference to capture group number " | |
Normal (greedy) quantifiers | ||
| Quantifier for “from | |
| Quantifier for " | |
| Quantifier for “exactly | |
| Quantifier for 0 up to | |
| Quantifier for 0 or more repetitions | Short for |
| Quantifier for 1 or more repetitions | Short for |
| Quantifier for 0 or 1 repetitions (i.e., present exactly once, or not at all) | Short for |
Reluctant (non-greedy) quantifiers | ||
| Reluctant quantifier for “from | |
| Reluctant quantifier for " | |
| Reluctant quantifier for 0 up to | |
| Reluctant quantifier: 0 or more | |
| Reluctant quantifier: 1 or more | |
| Reluctant quantifier: 0 or 1 times | |
Possessive (very greedy) quantifiers | ||
| Possessive quantifier for “from | |
| Possessive quantifier for " | |
| Possessive quantifier for 0 up to | |
| Possessive quantifier: 0 or more | |
| Possessive quantifier: 1 or more | |
| Possessive quantifier: 0 or 1 times | |
Escapes and shorthands | ||
| Escape (quote) character: turns most metacharacters off; turns subsequent alphabetic into metacharacters | |
| Escape (quote) all characters up to | |
| Ends quoting begun with | |
| Tab character | |
| Return (carriage return) character | |
| Newline character | |
| Form feed | |
| Character in a word | Use |
| A nonword character | |
| Numeric digit | Use |
| A nondigit character | |
| Whitespace | Space, tab, etc., as determined by |
| A nonwhitespace character | |
Unicode blocks (representative samples) | ||
| A character in the Greek block | (Simple block) |
| Any character not in the Greek block | |
| An uppercase letter | (Simple category) |
| A currency symbol | |
POSIX-style character classes (defined only for US-ASCII) | ||
| Alphanumeric characters |
|
| Alphabetic characters |
|
| Any ASCII character |
|
| Space and tab characters | |
| Space characters |
|
| Control characters |
|
| Numeric digit characters |
|
| Printable and visible characters (not spaces or control characters) | |
| Printable characters | |
| Punctuation characters | One of |
| Lowercase characters |
|
| Uppercase characters |
|
| Hexadecimal digit characters |
|
Regexes match anyplace possible in the string. Patterns followed by greedy quantifiers (the only type that existed in traditional Unix regexes) consume (match) as much as possible without compromising any subexpressions that follow; patterns followed by possessive quantifiers match as much as possible without regard to following subexpressions; patterns followed by reluctant quantifiers consume as few characters as possible to still get a match.
Also, unlike regex packages in some other languages, the Java regex package was designed to handle Unicode characters from the beginning. And the standard Java escape sequence \u
nnnn
is used to specify a Unicode character in the pattern. We use methods of java.lang.Character
to determine Unicode character properties, such as whether a given character is a space.
Again, note that the backslash must be doubled if this is in a Java
string that is being compiled because the compiler would otherwise
parse this as “backslash-u” followed by some numbers.
To help you learn how regexes work, I provide a little program called REDemo
.[18] The code for REDemo
is too long to include in the book; in the online directory regex of the darwinsys-api repo, you will find REDemo.java
, which you can run to explore how regexes work.
In the uppermost text box (see Figure 4-1), type the regex pattern you want to test. Note that as you type each character, the regex is checked for syntax; if the syntax is OK, you see a checkmark beside it. You can then select Match, Find, or Find All. Match means that the entire string must match the regex, and Find means the regex must be found somewhere in the string (Find All counts the number of occurrences that are found). Below that, you type a string that the regex is to match against. Experiment to your heart’s content. When you have the regex the way you want it, you can paste it into your Java program. You’ll need to escape (backslash) any characters that are treated specially by both the Java compiler and the Java regex package, such as the backslash itself, double quotes, and others (see the following sidebar). Once you get a regex the way you want it, there is a “Copy” button (not shown in these screenshots) to export the regex to the clipboard, with or without backslash doubling depending on how you want to use it.
In Figure 4-1, I typed qu
into the REDemo
program’s Pattern box, which is a syntactically valid regex pattern: any ordinary characters stand as regexes for themselves, so this looks for the letter q
followed by u
. In the top version, I typed only a q
into the string, which is not matched. In the second, I have typed quack
and the q
of a second quack
. Because I have selected Find All, the count shows one match. As soon as I type the second u
, the count is updated to two, as shown in the third version.
Regexes can do far more than just character matching. For example,
the two-character regex ^T
would match beginning of line (^
)
immediately followed by a capital T—that is, any line beginning with
a capital T. It doesn’t matter whether the line begins with Tiny
trumpets, Titanic tubas, or Triumphant twisted trombones, as
long as the capital T is present in the first position.
But here we’re not very far ahead. Have we really invested all this effort in regex technology just to be able to do what we could already do with the java.lang.String
method startsWith()
? Hmmm, I can hear some of you getting a bit restless. Stay in your seats! What if you wanted to match not only a letter T in the first position, but also a vowel (a, e, i, o, or u) immediately after it, followed by any number of letters in a word, followed by an exclamation point? Surely you could do this in Java by checking startsWith("T")
and charAt(1) == 'a' || charAt(1) == 'e'
, and so on? Yes, but by the time you did that, you’d have written a lot of very highly specialized code that you couldn’t use in any other application. With regular expressions, you can just give the pattern ^T[aeiou]\w*!
. That is, ^
and T
as before, followed by a character class listing the vowels, followed by any number of word characters (\w*
), followed by the exclamation point.
“But wait, there’s more!” as my late, great boss Yuri Rubinsky used to say. What if you want to be able to change the pattern you’re looking for at runtime? Remember all that Java code you just wrote to match T
in column 1, plus a vowel, some word characters, and an exclamation point? Well, it’s time to throw it out. Because this morning we need to match Q
, followed by a letter other than u
, followed by a number of digits, followed by a period. While some of you start writing a new function to do that, the rest of us will just saunter over to the RegEx Bar & Grille, order a ^Q[^u]\d+\.
. from the bartender, and be on our way.
OK, the [^u]
means match any one character that is not the
character u
. The \d+
means one or more numeric digits. The
+
is a quantifier meaning one or more occurrences of what it
follows, and \d
is any one numeric digit. So \d+
means a number
with one, two, or more digits. Finally, the \.
? Well, . by itself
is a metacharacter. Most single metacharacters are switched off by
preceding them with an escape character. Not the Esc key on your
keyboard, of course. The regex “escape” character is the backslash.
Preceding a metacharacter like . with this escape turns off its
special meaning, so we look for a literal period rather than “any character.”
Preceding a few selected alphabetic characters
(e.g., n
, r
, t
, s
, w
) with escape turns them into
metacharacters. Figure 4-2 shows the Q[u]\d
\..+
regex in action. In the first frame, I have typed part of the regex
as Q[u
and because there is an unclosed square bracket, the
Syntax OK flag is turned off; when I complete the regex, it will
be turned back on. In the second frame, I have finished typing the regex,
and typed the data string as QA577
(which you should expect to match
the ^Q[^u]\d+
, but not the period since I haven’t typed
it). In the third frame, I’ve typed the period so the Matches flag
is set to Yes.
One good way to think of regular expressions is as a “little language” for matching patterns of characters in text contained in strings. Give yourself extra points if you’ve already recognized this as the design pattern known as Interpreter. A regular expression API is an interpreter for matching regular expressions.
So now you should have at least a basic grasp of how regexes work in practice. The rest of this chapter gives more examples and explains some of the more powerful topics, such as capture groups. As for how regexes work in theory—and there are a lot of theoretical details and differences among regex flavors—the interested reader is referred to in Mastering Regular Expressions. Meanwhile, let’s start learning how to write Java programs that use regular expressions.
Using regexes in Java: Test for a Pattern
Problem
You’re ready to get started using regular expression processing to beef up your Java code by testing to see if a given pattern can match in a given string.
Solution
Use the Java Regular Expressions Package, java.util.regex
.
Discussion
The good news is that the Java API for regexes is actually easy to use. If all you need is to find out whether a given regex matches a string, you can use the convenient boolean matches()
method of the String
class, which accepts a regex pattern in String
form as its argument:
if
(
inputString
.
matches
(
stringRegexPattern
))
{
// it matched... do something with it...
}
This is, however, a convenience routine, and convenience always comes at a price. If the regex is going to be used more than once or twice in a program, it is more efficient to construct and use a Pattern
and its Matcher
(s). A complete program constructing a Pattern
and using it to match
is shown here:
public
class
RESimple
{
public
static
void
main
(
String
[]
argv
)
{
String
pattern
=
"^Q[^u]\\d+\\."
;
String
[]
input
=
{
"QA777. is the next flight. It is on time."
,
"Quack, Quack, Quack!"
};
Pattern
p
=
Pattern
.
compile
(
pattern
);
for
(
String
in
:
input
)
{
boolean
found
=
p
.
matcher
(
in
).
lookingAt
();
System
.
out
.
println
(
"'"
+
pattern
+
"'"
+
(
found
?
" matches '"
:
" doesn't match '"
)
+
in
+
"'"
);
}
}
}
The java.util.regex
package consists of two classes, Pattern
and Matcher
, which provide the public API shown in Example 4-1.
/** The main public API of the java.util.regex package.
* Prepared by javap and Ian Darwin.
*/
package
java
.
util
.
regex
;
public
final
class
Pattern
{
// Flags values ('or' together)
public
static
final
int
UNIX_LINES
,
CASE_INSENSITIVE
,
COMMENTS
,
MULTILINE
,
DOTALL
,
UNICODE_CASE
,
CANON_EQ
;
// No public constructors; use these Factory methods
public
static
Pattern
compile
(
String
patt
);
public
static
Pattern
compile
(
String
patt
,
int
flags
);
// Method to get a Matcher for this Pattern
public
Matcher
matcher
(
CharSequence
input
);
// Information methods
public
String
pattern
();
public
int
flags
();
// Convenience methods
public
static
boolean
matches
(
String
pattern
,
CharSequence
input
);
public
String
[]
split
(
CharSequence
input
);
public
String
[]
split
(
CharSequence
input
,
int
max
);
}
public
final
class
Matcher
{
// Action: find or match methods
public
boolean
matches
();
public
boolean
find
();
public
boolean
find
(
int
start
);
public
boolean
lookingAt
();
// "Information about the previous match" methods
public
int
start
();
public
int
start
(
int
whichGroup
);
public
int
end
();
public
int
end
(
int
whichGroup
);
public
int
groupCount
();
public
String
group
();
public
String
group
(
int
whichGroup
);
// Reset methods
public
Matcher
reset
();
public
Matcher
reset
(
CharSequence
newInput
);
// Replacement methods
public
Matcher
appendReplacement
(
StringBuffer
where
,
String
newText
);
public
StringBuffer
appendTail
(
StringBuffer
where
);
public
String
replaceAll
(
String
newText
);
public
String
replaceFirst
(
String
newText
);
// information methods
public
Pattern
pattern
();
}
/* String, showing only the RE-related methods */
public
final
class
String
{
public
boolean
matches
(
String
regex
);
public
String
replaceFirst
(
String
regex
,
String
newStr
);
public
String
replaceAll
(
String
regex
,
String
newStr
);
public
String
[]
split
(
String
regex
);
public
String
[]
split
(
String
regex
,
int
max
);
}
This API is large enough to require some explanation. The normal steps for regex matching in a production program are:
-
Create a
Pattern
by calling the static methodPattern.compile()
. -
Request a
Matcher
from the pattern by callingpattern.matcher(CharSequence)
for eachString
(or otherCharSequence
) you wish to look through. -
Call (once or more) one of the finder methods (discussed later in this section) in the resulting
Matcher
.
The java.lang.CharSequence
interface provides simple read-only access to objects containing a collection of characters. The standard implementations are String
and StringBuffer
/StringBuilder
(described in Chapter 3), and the “new I/O” class java.nio.CharBuffer
.
Of course, you can perform regex matching in other ways, such as using the convenience methods in Pattern
or even in java.lang.String
. For example:
public
class
StringConvenience
{
public
static
void
main
(
String
[]
argv
)
{
String
pattern
=
".*Q[^u]\\d+\\..*"
;
String
line
=
"Order QT300. Now!"
;
if
(
line
.
matches
(
pattern
))
{
System
.
out
.
println
(
line
+
" matches \""
+
pattern
+
"\""
);
}
else
{
System
.
out
.
println
(
"NO MATCH"
);
}
}
}
But the three-step list just described is the “standard” pattern for matching. You’d
likely use the String
convenience routine in a program that only used the regex once; if
the regex were being used more than once, it is worth taking the time to “compile” it because the compiled version runs faster.
In addition, the Matcher
has several finder methods, which provide more flexibility than the
String
convenience routine match()
. The Matcher
methods are:
-
match()
-
Useda to compare the entire string against the pattern; this is the same as
the routine in
java.lang.String
. Because it matches the entireString
, I had to put.*
before and after the pattern. -
lookingAt()
- Used to match the pattern only at the beginning of the string.
-
find()
- Used to match the pattern in the string (not necessarily at the first character of the string), starting at the beginning of the string or, if the method was previously called and succeeded, at the first character not matched by the previous match.
Each of these methods returns boolean
, with true
meaning a match and false
meaning no match. To check whether a given string matches a given pattern, you need only type something like the following:
Matcher
m
=
Pattern
.
compile
(
patt
).
matcher
(
line
);
if
(
m
.
find
(
))
{
System
.
out
.
println
(
line
+
" matches "
+
patt
)
}
But you may also want to extract the text that matched, which is the subject of the next recipe.
The following recipes cover uses of this API. Initially, the examples just use arguments of type String
as the input source. Use of other CharSequence
types is covered in Printing All Occurrences of a Pattern.
Finding the Matching Text
Solution
Sometimes you need to know more than just whether a regex matched a string. In editors and
many other tools, you want to know exactly what characters were matched. Remember that
with quantifiers such as *, the length of the text that was matched may have no
relationship to the length of the pattern that matched it. Do not underestimate the mighty
.*
, which happily matches thousands or millions of characters if allowed to. As you saw
in the previous recipe, you can find out whether a given match succeeds just by using
find()
or matches()
. But in other applications, you will want to get the characters
that the pattern matched.
After a successful call to one of the preceding methods, you can use these “information” methods to get information on the match:
-
start(), end()
- Returns the character position in the string of the starting and ending characters that matched.
-
groupCount()
- Returns the number of parenthesized capture groups, if any; returns 0 if no groups were used.
-
group(int i)
-
Returns the characters matched by group
i
of the current match, ifi
is greater than or equal to zero and less than or equal to the return value ofgroupCount()
. Group 0 is the entire match, sogroup(0)
(or justgroup()
) returns the entire portion of the input that matched.
The notion of parentheses or “capture groups” is central to regex processing. Regexes may be nested to any level of complexity. The group(int)
method lets you retrieve the characters that matched a given parenthesis group. If you haven’t used any explicit parens, you can just treat whatever matched as “level zero.” Example 4-2 shows part of REMatch.java.
public
class
REmatch
{
public
static
void
main
(
String
[]
argv
)
{
String
patt
=
"Q[^u]\\d+\\."
;
Pattern
r
=
Pattern
.
compile
(
patt
);
String
line
=
"Order QT300. Now!"
;
Matcher
m
=
r
.
matcher
(
line
);
if
(
m
.
find
())
{
System
.
out
.
println
(
patt
+
" matches \""
+
m
.
group
(
0
)
+
"\" in \""
+
line
+
"\""
);
}
else
{
System
.
out
.
println
(
"NO MATCH"
);
}
}
}
When run, this prints:
Q[\^u]\d+\. matches "QT300." in "Order QT300. Now!"
An extended version of the REDemo
program presented in
Using regexes in Java: Test for a Pattern, called REDemo2
, provides a display of
all the capture groups in a given regex; one example is shown in
Figure 4-3.
It is also possible to get the starting and ending indices and the
length of the text that the pattern matched (remember that terms
with quantifiers, such as the \d+
in this example, can match
an arbitrary number of characters in the string). You can use these
in conjunction with the String.substring()
methods as follows:
String
patt
=
"Q[^u]\\d+\\."
;
Pattern
r
=
Pattern
.
compile
(
patt
);
String
line
=
"Order QT300. Now!"
;
Matcher
m
=
r
.
matcher
(
line
);
if
(
m
.
find
())
{
System
.
out
.
println
(
patt
+
" matches \""
+
line
.
substring
(
m
.
start
(
0
),
m
.
end
(
0
))
+
"\" in \""
+
line
+
"\""
);
}
else
{
System
.
out
.
println
(
"NO MATCH"
);
}
Suppose you need to extract several items from a string. If the input is:
Smith, John Adams, John Quincy
and you want to get out:
John Smith John Quincy Adams
public
class
REmatchTwoFields
{
public
static
void
main
(
String
[]
args
)
{
String
inputLine
=
"Adams, John Quincy"
;
// Construct an RE with parens to "grab" both field1 and field2
Pattern
r
=
Pattern
.
compile
(
"(.*), (.*)"
);
Matcher
m
=
r
.
matcher
(
inputLine
);
if
(!
m
.
matches
())
throw
new
IllegalArgumentException
(
"Bad input"
);
System
.
out
.
println
(
m
.
group
(
2
)
+
' '
+
m
.
group
(
1
));
}
}
Replacing the Matched Text
As we saw in the previous recipe, regex patterns involving quantifiers can match a lot of characters with very few metacharacters. We need a way to replace the text that the regex matched without changing other text before or after it. We could do this manually using the String
method substring()
. However, because it’s such a common requirement, the Java Regular Expression API provides some substitution methods. In all these methods, you pass in the replacement text or “righthand side” of the substitution (this term is historical: in a command-line text editor’s substitute command, the lefthand side is the pattern and the righthand side is the replacement text). The replacement methods are:
Example 4-3 shows use of these three methods.
/**
* Quick demo of RE substitution: correct U.S. 'favor'
* to Canadian/British 'favour', but not in "favorite"
* @author Ian F. Darwin, http://www.darwinsys.com/
*/
public
class
ReplaceDemo
{
public
static
void
main
(
String
[]
argv
)
{
// Make an RE pattern to match as a word only (\b=word boundary)
String
patt
=
"\\bfavor\\b"
;
// A test input.
String
input
=
"Do me a favor? Fetch my favorite."
;
System
.
out
.
println
(
"Input: "
+
input
);
// Run it from a RE instance and see that it works
Pattern
r
=
Pattern
.
compile
(
patt
);
Matcher
m
=
r
.
matcher
(
input
);
System
.
out
.
println
(
"ReplaceAll: "
+
m
.
replaceAll
(
"favour"
));
// Show the appendReplacement method
m
.
reset
();
StringBuffer
sb
=
new
StringBuffer
();
System
.
out
.
(
"Append methods: "
);
while
(
m
.
find
())
{
// Copy to before first match,
// plus the word "favor"
m
.
appendReplacement
(
sb
,
"favour"
);
}
m
.
appendTail
(
sb
);
// copy remainder
System
.
out
.
println
(
sb
.
toString
());
}
}
Sure enough, when you run it, it does what we expect:
Input: Do me a favor? Fetch my favorite. ReplaceAll: Do me a favour? Fetch my favorite. Append methods: Do me a favour? Fetch my favorite.
Printing All Occurrences of a Pattern
Problem
You need to find all the strings that match a given regex in one or more files or other sources.
Solution
This example reads through a file one line at a time. Whenever a match is found, I extract it from the line
and print it.
This code takes the group()
methods from Finding the Matching Text, the substring
method from the CharacterIterator
interface, and the match()
method from the regex and simply puts them all together. I coded it to extract all the “names” from a given file; in running the program through itself, it prints the words import
, java
, until
, regex
, and so on, each on its own line:
C:\\>javac -d . ReaderIter.java
C:\\>java regex.ReaderIter ReaderIter.java
import java util regex import java io Print all the strings that match given pattern from file public ... C:\\>
I interrupted it here to save paper. This can be written two ways: a traditional “line at a time” pattern shown in Example 4-4 and a more compact form using “new I/O” shown in Example 4-5 (the “new I/O” package is described in Chapter 10).
public
class
ReaderIter
{
public
static
void
main
(
String
[]
args
)
throws
IOException
{
// The RE pattern
Pattern
patt
=
Pattern
.
compile
(
"[A-Za-z][a-z]+"
);
// A FileReader (see the I/O chapter)
BufferedReader
r
=
new
BufferedReader
(
new
FileReader
(
args
[
0
]));
// For each line of input, try matching in it.
String
line
;
while
((
line
=
r
.
readLine
())
!=
null
)
{
// For each match in the line, extract and print it.
Matcher
m
=
patt
.
matcher
(
line
);
while
(
m
.
find
())
{
// Simplest method:
// System.out.println(m.group(0));
// Get the starting position of the text
int
start
=
m
.
start
(
0
);
// Get ending position
int
end
=
m
.
end
(
0
);
// Print whatever matched.
// Use CharacterIterator.substring(offset, end);
System
.
out
.
println
(
line
.
substring
(
start
,
end
));
}
}
}
}
public
class
GrepNIO
{
public
static
void
main
(
String
[]
args
)
throws
IOException
{
if
(
args
.
length
<
2
)
{
System
.
err
.
println
(
"Usage: GrepNIO patt file [...]"
);
System
.
exit
(
1
);
}
Pattern
p
=
Pattern
.
compile
(
args
[
0
]);
for
(
int
i
=
1
;
i
<
args
.
length
;
i
++)
process
(
p
,
args
[
i
]);
}
static
void
process
(
Pattern
pattern
,
String
fileName
)
throws
IOException
{
// Get a FileChannel from the given file.
FileChannel
fc
=
new
FileInputStream
(
fileName
).
getChannel
();
// Map the file's content
ByteBuffer
buf
=
fc
.
map
(
FileChannel
.
MapMode
.
READ_ONLY
,
0
,
fc
.
size
());
// Decode ByteBuffer into CharBuffer
CharBuffer
cbuf
=
Charset
.
forName
(
"ISO-8859-1"
).
newDecoder
().
decode
(
buf
);
Matcher
m
=
pattern
.
matcher
(
cbuf
);
while
(
m
.
find
())
{
System
.
out
.
println
(
m
.
group
(
0
));
}
}
}
The NIO version shown in Example 4-5 relies on the fact that an NIO Buffer
can be used as a CharSequence
. This program is more general in that the pattern argument is taken from the command-line argument. It prints the same output as the previous example if invoked with the pattern argument from the previous program on the command line:
java regex.GrepNIO "[A-Za-z][a-z]+" ReaderIter.java
You might think of using \w+
as the pattern; the only difference is that my pattern looks for well-formed capitalized words, whereas \w+
would include Java-centric oddities like theVariableName
, which have capitals in nonstandard positions.
Also note that the NIO version will probably be more efficient because it doesn’t reset the Matcher
to a new input source on each line of input as ReaderIter
does.
Printing Lines Containing a Pattern
Solution
Write a simple grep-like program.
Discussion
As I’ve mentioned, once you have a regex package, you can write a grep-like program. I gave an example of the Unix grep program earlier. grep is called with some optional arguments, followed by one required regular expression pattern, followed by an arbitrary number of filenames. It prints any line that contains the pattern, differing from Printing All Occurrences of a Pattern, which prints only the matching text itself. For example:
grep "[dD]arwin" *.txt
The preceding code searches for lines containing either darwin
or Darwin
in every line of every file whose name ends in .txt.[19] Example 4-6 is the source for the first version of a program to do this, called Grep0
. It reads lines from the standard input and doesn’t take any optional arguments, but it handles the full set of regular expressions that the Pattern
class implements (it is, therefore, not identical to the Unix programs of the same name). We haven’t covered the java.io
package for input and output yet (see Chapter 10), but our use of it here is simple enough that you can probably intuit it. The online source includes Grep1
, which does the same thing but is better structured (and therefore longer). Later in this chapter, Program: Full Grep presents a JGrep
program that uses my GetOpt
(see Parsing Command-Line Arguments) to parse command-line options.
public
class
Grep0
{
public
static
void
main
(
String
[]
args
)
throws
IOException
{
BufferedReader
is
=
new
BufferedReader
(
new
InputStreamReader
(
System
.
in
));
if
(
args
.
length
!=
1
)
{
System
.
err
.
println
(
"Usage: MatchLines pattern"
);
System
.
exit
(
1
);
}
Pattern
patt
=
Pattern
.
compile
(
args
[
0
]);
Matcher
matcher
=
patt
.
matcher
(
""
);
String
line
=
null
;
while
((
line
=
is
.
readLine
())
!=
null
)
{
matcher
.
reset
(
line
);
if
(
matcher
.
find
())
{
System
.
out
.
println
(
"MATCH: "
+
line
);
}
}
}
}
Controlling Case in Regular Expressions
Solution
Compile the Pattern
passing in the flags
argument Pattern.CASE_INSENSITIVE
to indicate that matching should be case-independent (“fold” or ignore differences in case). If your code might run in different locales (see Chapter 15) then you should add Pattern.UNICODE_CASE
. Without these flags, the default is normal, case-sensitive matching behavior. This flag (and others) are passed to the Pattern.compile()
method, as in:
// CaseMatch.java Pattern reCaseInsens = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE); reCaseInsens.matches(input); // will match case-insensitively
This flag must be passed when you create the Pattern
; because Pattern
objects are immutable, they cannot be changed once constructed.
The full source code for this example is online as CaseMatch.java.
Matching “Accented” or Composite Characters
Solution
Compile the Pattern
with the flags
argument Pattern.CANON_EQ
for “canonical equality.”
Discussion
Composite characters can be entered in various forms. Consider, as a single example, the letter e
with an acute accent. This character may be found in various forms in Unicode text, such as the single character é
(Unicode character \u00e9
) or as the two-character sequence e´
(e followed by the Unicode combining acute accent, \u0301
). To allow you to match such characters regardless of which of possibly multiple “fully decomposed” forms are used to enter them, the regex package has an option for “canonical matching,” which treats any of the forms as equivalent. This option is enabled by passing CANON_EQ
as (one of) the flags in the second argument to Pattern.compile()
. This program shows CANON_EQ
being used to match several forms:
public
class
CanonEqDemo
{
public
static
void
main
(
String
[]
args
)
{
String
pattStr
=
"\u00e9gal"
;
// egal
String
[]
input
=
{
"\u00e9gal"
,
// egal - this one had better match :-)
"e\u0301gal"
,
// e + "Combining acute accent"
"e\u02cagal"
,
// e + "modifier letter acute accent"
"e'gal"
,
// e + single quote
"e\u00b4gal"
,
// e + Latin-1 "acute"
};
Pattern
pattern
=
Pattern
.
compile
(
pattStr
,
Pattern
.
CANON_EQ
);
for
(
int
i
=
0
;
i
<
input
.
length
;
i
++)
{
if
(
pattern
.
matcher
(
input
[
i
]).
matches
())
{
System
.
out
.
println
(
pattStr
+
" matches input "
+
input
[
i
]);
}
else
{
System
.
out
.
println
(
pattStr
+
" does not match input "
+
input
[
i
]);
}
}
}
}
This program correctly matches the “combining accent” and rejects the other characters, some of which, unfortunately, look like the accent on a printer, but are not considered “combining accent” characters:
égal matches input égal égal matches input e?gal égal does not match input e?gal égal does not match input e'gal égal does not match input e´gal
For more details, see the character charts.
Matching Newlines in Text
Solution
Use \n
or \r
.
See also the flags constant Pattern.MULTILINE
, which makes newlines match as beginning-of-line and end-of-line (\^
and $
).
Discussion
Though line-oriented tools from Unix such as sed and grep match regular expressions one line at a time, not all tools do. The sam text editor from Bell Laboratories was the first interactive tool I know of to allow multiline regular expressions; the Perl scripting language followed shortly after. In the Java API, the newline character by default has no special significance. The BufferedReader
method readLine()
normally strips out whichever newline characters it finds. If you read in gobs of characters using some method other than readLine()
, you may have some number of \n
, \r
, or \r\n
sequences in your text string.[20] Normally all of these are treated as equivalent to \n
. If you want only \n
to match, use the UNIX_LINES
flag to the Pattern.compile()
method.
In Unix, ^
and $
are commonly used to match the beginning or end of a line, respectively. In this API, the regex metacharacters \^
and $
ignore line terminators and only match at the beginning and the end, respectively, of the entire string. However, if you pass the MULTILINE
flag into Pattern.compile()
, these expressions match just after or just before, respectively, a line terminator; $
also matches the very end of the string. Because the line ending is just an ordinary character, you can match it with . or similar expressions, and, if you want to know exactly where it is, \n
or \r
in the pattern match it as well. In other words, to this API, a newline character is just another character with no special significance. See the sidebar Pattern.compile() Flags. An example of newline matching is shown in Example 4-7.
public
class
NLMatch
{
public
static
void
main
(
String
[]
argv
)
{
String
input
=
"I dream of engines\nmore engines, all day long"
;
System
.
out
.
println
(
"INPUT: "
+
input
);
System
.
out
.
println
();
String
[]
patt
=
{
"engines.more engines"
,
"ines\nmore"
,
"engines$"
};
for
(
int
i
=
0
;
i
<
patt
.
length
;
i
++)
{
System
.
out
.
println
(
"PATTERN "
+
patt
[
i
]);
boolean
found
;
Pattern
p1l
=
Pattern
.
compile
(
patt
[
i
]);
found
=
p1l
.
matcher
(
input
).
find
();
System
.
out
.
println
(
"DEFAULT match "
+
found
);
Pattern
pml
=
Pattern
.
compile
(
patt
[
i
],
Pattern
.
DOTALL
|
Pattern
.
MULTILINE
);
found
=
pml
.
matcher
(
input
).
find
();
System
.
out
.
println
(
"MultiLine match "
+
found
);
System
.
out
.
println
();
}
}
}
If you run this code, the first pattern (with the wildcard character .) always matches, whereas the second pattern (with $
) matches only when MATCH_MULTILINE
is set:
> java regex.NLMatch
INPUT: I dream of engines
more engines, all day long
PATTERN engines
more engines
DEFAULT match true
MULTILINE match: true
PATTERN engines$
DEFAULT match false
MULTILINE match: true
Program: Apache Logfile Parsing
The Apache web server is the world’s leading web server and has been for most of the Web’s history. It is one of the world’s best-known open source projects, and the first of many fostered by the Apache Foundation. But the name Apache is often claimed to be a pun on the origins of the server; its developers began with the free NCSA server and kept hacking at it or “patching” it until it did what they wanted. When it was sufficiently different from the original, a new name was needed. Because it was now “a patchy server,” the name Apache was chosen. Officialdom denies the story, but it’s cute anyway. One place actual patchiness does show through is in the logfile format. Consider Example 4-8.
123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0" 200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"
The file format was obviously designed for human inspection but not for easy parsing. The problem is that different delimiters are used: square brackets for the date, quotes for the request line, and spaces sprinkled all through. Consider trying to use a StringTokenizer
; you might be able to get it working, but you’d spend a lot of time fiddling with it. However, this somewhat contorted regular expression[21] makes it easy to parse:
\^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([\^"]+)" "([\^"]+)"
You may find it informative to refer back to Table 4-1 and review the full syntax used here. Note in particular the use of the nongreedy quantifier +?
in \"(.+?)\
" to match a quoted string; you can’t just use .+
because that would match too much (up to the quote at the end of the line). Code to extract the various fields such as IP address, request, referrer URL, and browser version is shown in Example 4-9.
public
class
LogRegExp
{
public
static
void
main
(
String
argv
[])
{
String
logEntryPattern
=
"^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+-]\\d{4})\\] "
+
"\"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\""
;
System
.
out
.
println
(
"RE Pattern:"
);
System
.
out
.
println
(
logEntryPattern
);
System
.
out
.
println
(
"Input line is:"
);
String
logEntryLine
=
LogExample
.
logEntryLine
;
System
.
out
.
println
(
logEntryLine
);
Pattern
p
=
Pattern
.
compile
(
logEntryPattern
);
Matcher
matcher
=
p
.
matcher
(
logEntryLine
);
if
(!
matcher
.
matches
()
||
LogExample
.
NUM_FIELDS
!=
matcher
.
groupCount
())
{
System
.
err
.
println
(
"Bad log entry (or problem with regex):"
);
System
.
err
.
println
(
logEntryLine
);
return
;
}
System
.
out
.
println
(
"IP Address: "
+
matcher
.
group
(
1
));
System
.
out
.
println
(
"UserName: "
+
matcher
.
group
(
3
));
System
.
out
.
println
(
"Date/Time: "
+
matcher
.
group
(
4
));
System
.
out
.
println
(
"Request: "
+
matcher
.
group
(
5
));
System
.
out
.
println
(
"Response: "
+
matcher
.
group
(
6
));
System
.
out
.
println
(
"Bytes Sent: "
+
matcher
.
group
(
7
));
if
(!
matcher
.
group
(
8
).
equals
(
"-"
))
System
.
out
.
println
(
"Referer: "
+
matcher
.
group
(
8
));
System
.
out
.
println
(
"User-Agent: "
+
matcher
.
group
(
9
));
}
}
The implements
clause is for an interface that just defines the input string; it was used in a demonstration to compare the regular expression mode with the use of a StringTokenizer
. The source for both versions is in the online source for this chapter. Running the program against the sample input from Example 4-8 gives this output:
Using regex Pattern: \^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([\^"]+)" "([\^"]+)" Input line is: 123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0" 200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)" IP Address: 123.45.67.89 Date&Time: 27/Oct/2000:09:27:09 -0400 Request: GET /java/javaResources.html HTTP/1.0 Response: 200 Bytes Sent: 10450 Browser: Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)
The program successfully parsed the entire logfile format with one call to matcher.matches()
.
Program: Data Mining
Suppose that I, as a published author, want to track how my book is selling in comparison to others. I can obtain this information for free just by clicking the page for my book on any of the major bookseller sites, reading the sales rank number off the screen, and typing the number into a file—but that’s too tedious. As I wrote in the book that this example looks for, “computers get paid to extract relevant information from files; people should not have to do such mundane tasks.” This program uses the Regular Expressions API and, in particular, newline matching to extract a value from an HTML page on the hypothetical QuickBookShops.web website. It also reads from a URL object (see REST Web Service Client). The pattern to look for is something like this (bear in mind that the HTML may change at any time, so I want to keep the pattern fairly general):
<b>QuickBookShop.web Sales Rank: </b> 26,252 </font><br>
Because the pattern may extend over more than one line, I read the entire web page from the URL into a single long string using my FileIO.readerToString()
method (see Reading a File into a String) instead of the more traditional line-at-a-time paradigm. I then plot a graph using an external program (see Running an External Program from Java); this could (and should) be changed to use a Java graphics program (see Program: Grapher for some leads). The complete program is shown in Example 4-10.
public
class
BookRank
{
public
final
static
String
DATA_FILE
=
"book.sales"
;
public
final
static
String
GRAPH_FILE
=
"book.png"
;
public
final
static
String
PLOTTER_PROG
=
"/usr/local/bin/gnuplot"
;
final
static
String
isbn
=
"0596007019"
;
final
static
String
title
=
"Java Cookbook"
;
/** Grab the sales rank off the web page and log it. */
public
static
void
main
(
String
[]
args
)
throws
Exception
{
Properties
p
=
new
Properties
();
p
.
load
(
new
FileInputStream
(
args
.
length
==
0
?
"bookrank.properties"
:
args
[
1
]));
String
title
=
p
.
getProperty
(
"title"
,
"NO TITLE IN PROPERTIES"
);
// The url must have the "isbn=" at the very end, or otherwise
// be amenable to being string-catted to, like the default.
String
url
=
p
.
getProperty
(
"url"
,
"http://test.ing/test.cgi?isbn="
);
// The 10-digit ISBN for the book.
String
isbn
=
p
.
getProperty
(
"isbn"
,
"0000000000"
);
// The RE pattern (MUST have ONE capture group for the number)
String
pattern
=
p
.
getProperty
(
"pattern"
,
"Rank: (\\d+)"
);
int
rank
=
getBookRank
(
isbn
);
System
.
out
.
println
(
"Rank is "
+
rank
);
// Now try to draw the graph, using external
// plotting program against all historical data.
// Could use gnuplot, R, any other math/graph program.
// Better yet: use one of the Java plotting APIs.
PrintWriter
pw
=
new
PrintWriter
(
new
FileWriter
(
DATA_FILE
,
true
));
String
date
=
new
SimpleDateFormat
(
"MM dd hh mm ss yyyy "
).
format
(
new
Date
());
pw
.
println
(
date
+
" "
+
rank
);
pw
.
close
();
String
gnuplot_cmd
=
"set term png\n"
+
"set output \""
+
GRAPH_FILE
+
"\"\n"
+
"set xdata time\n"
+
"set ylabel \"Book sales rank\"\n"
+
"set bmargin 3\n"
+
"set logscale y\n"
+
"set yrange [1:60000] reverse\n"
+
"set timefmt \"%m %d %H %M %S %Y\"\n"
+
"plot \""
+
DATA_FILE
+
"\" using 1:7 title \""
+
title
+
"\" with lines\n"
;
if
(!
new
File
(
PLOTTER_PROG
).
exists
())
{
System
.
out
.
println
(
"Plotting software not installed"
);
return
;
}
Process
proc
=
Runtime
.
getRuntime
().
exec
(
PLOTTER_PROG
);
PrintWriter
gp
=
new
PrintWriter
(
proc
.
getOutputStream
());
gp
.
(
gnuplot_cmd
);
gp
.
close
();
}
/**
* Look for something like this in the HTML input:
* <b>Sales Rank:</b>
* #26,252
* </font><br>
* @throws IOException
* @throws IOException
*/
public
static
int
getBookRank
(
String
isbn
)
throws
IOException
{
// The RE pattern - digits and commas allowed
final
String
pattern
=
"Rank:</b> #([\\d,]+)"
;
final
Pattern
r
=
Pattern
.
compile
(
pattern
);
// The url -- must have the "isbn=" at the very end, or otherwise
// be amenable to being appended to.
final
String
url
=
"http://www.amazon.com/exec/obidos/ASIN/"
+
isbn
;
// Open the URL and get a Reader from it.
final
BufferedReader
is
=
new
BufferedReader
(
new
InputStreamReader
(
new
URL
(
url
).
openStream
()));
// Read the URL looking for the rank information, as
// a single long string, so can match RE across multi-lines.
final
String
input
=
readerToString
(
is
);
// If found, append to sales data file.
Matcher
m
=
r
.
matcher
(
input
);
if
(
m
.
find
())
{
// Paren 1 is the digits (and maybe ','s) that matched; remove comma
return
Integer
.
parseInt
(
m
.
group
(
1
).
replace
(
","
,
""
));
}
else
{
throw
new
RuntimeException
(
"Pattern not matched in `"
+
url
+
"'!"
);
}
}
private
static
String
readerToString
(
BufferedReader
is
)
throws
IOException
{
StringBuilder
sb
=
new
StringBuilder
();
String
line
;
while
((
line
=
is
.
readLine
())
!=
null
)
{
sb
.
append
(
line
);
}
return
sb
.
toString
();
}
}
Program: Full Grep
Now that we’ve seen how the regular expressions package works, it’s time to write JGrep
, a full-blown version of the line-matching program with option parsing. Table 4-2 lists some typical command-line options that a Unix implementation of grep might include.
Option | Meaning |
-c | Count only: don’t print lines, just count them |
-C | Context; print some lines above and below each line that matches (not implemented in this version; left as an exercise for the reader) |
-f pattern | Take pattern from file named after |
-h | Suppress printing filename ahead of lines |
-i | Ignore case |
-l | List filenames only: don’t print lines, just the names they’re found in |
-n | Print line numbers before matching lines |
-s | Suppress printing certain error messages |
-v | Invert: print only lines that do NOT match the pattern |
We discussed the GetOpt
class in Parsing Command-Line Arguments. Here we use it to control the operation of an application program. As usual, because main()
runs in a static context but our application main line does not, we could wind up passing a lot of information into the constructor.
To save space, this version just uses global variables to track the settings from the command line.
Unlike the Unix grep
tool, this one does not yet handle “combined options,” so -l -r -i
is OK,
but -lri
will fail, due to a limitation in the GetOpt
parser used.
The program basically just reads lines, matches the pattern in them, and, if a match is found (or not found, with -v
), prints the line (and optionally some other stuff, too). Having said all that, the code is shown in Example 4-11.
/** A command-line grep-like program. Accepts some command-line options,
* and takes a pattern and a list of text files.
* N.B. The current implementation of GetOpt does not allow combining short
* arguments, so put spaces e.g., "JGrep -l -r -i pattern file..." is OK, but
* "JGrep -lri pattern file..." will fail. Getopt will hopefully be fixed soon.
*/
public
class
JGrep
{
private
static
final
String
USAGE
=
"Usage: JGrep pattern [-chilrsnv][-f pattfile][filename...]"
;
/** The pattern we're looking for */
protected
Pattern
pattern
;
/** The matcher for this pattern */
protected
Matcher
matcher
;
private
boolean
debug
;
/** Are we to only count lines, instead of printing? */
protected
static
boolean
countOnly
=
false
;
/** Are we to ignore case? */
protected
static
boolean
ignoreCase
=
false
;
/** Are we to suppress printing of filenames? */
protected
static
boolean
dontPrintFileName
=
false
;
/** Are we to only list names of files that match? */
protected
static
boolean
listOnly
=
false
;
/** are we to print line numbers? */
protected
static
boolean
numbered
=
false
;
/** Are we to be silent about errors? */
protected
static
boolean
silent
=
false
;
/** are we to print only lines that DONT match? */
protected
static
boolean
inVert
=
false
;
/** Are we to process arguments recursively if directories? */
protected
static
boolean
recursive
=
false
;
/** Construct a Grep object for the pattern, and run it
* on all input files listed in argv.
* Be aware that a few of the command-line options are not
* acted upon in this version - left as an exercise for the reader!
*/
public
static
void
main
(
String
[]
argv
)
{
if
(
argv
.
length
<
1
)
{
System
.
err
.
println
(
USAGE
);
System
.
exit
(
1
);
}
String
patt
=
null
;
GetOpt
go
=
new
GetOpt
(
"cf:hilnrRsv"
);
char
c
;
while
((
c
=
go
.
getopt
(
argv
))
!=
0
)
{
switch
(
c
)
{
case
'c'
:
countOnly
=
true
;
break
;
case
'f'
:
/* External file contains the pattern */
try
(
BufferedReader
b
=
new
BufferedReader
(
new
FileReader
(
go
.
optarg
())))
{
patt
=
b
.
readLine
();
}
catch
(
IOException
e
)
{
System
.
err
.
println
(
"Can't read pattern file "
+
go
.
optarg
());
System
.
exit
(
1
);
}
break
;
case
'h'
:
dontPrintFileName
=
true
;
break
;
case
'i'
:
ignoreCase
=
true
;
break
;
case
'l'
:
listOnly
=
true
;
break
;
case
'n'
:
numbered
=
true
;
break
;
case
'r'
:
case
'R'
:
recursive
=
true
;
break
;
case
's'
:
silent
=
true
;
break
;
case
'v'
:
inVert
=
true
;
break
;
case
'?'
:
System
.
err
.
println
(
"Getopts was not happy!"
);
System
.
err
.
println
(
USAGE
);
break
;
}
}
int
ix
=
go
.
getOptInd
();
if
(
patt
==
null
)
patt
=
argv
[
ix
++];
JGrep
prog
=
null
;
try
{
prog
=
new
JGrep
(
patt
);
}
catch
(
PatternSyntaxException
ex
)
{
System
.
err
.
println
(
"RE Syntax error in "
+
patt
);
return
;
}
if
(
argv
.
length
==
ix
)
{
dontPrintFileName
=
true
;
// Don't print filenames if stdin
if
(
recursive
)
{
System
.
err
.
println
(
"Warning: recursive search of stdin!"
);
}
prog
.
process
(
new
InputStreamReader
(
System
.
in
),
null
);
}
else
{
if
(!
dontPrintFileName
)
dontPrintFileName
=
ix
==
argv
.
length
-
1
;
// Nor if only one file.
if
(
recursive
)
dontPrintFileName
=
false
;
// unless a directory!
for
(
int
i
=
ix
;
i
<
argv
.
length
;
i
++)
{
// note starting index
try
{
prog
.
process
(
new
File
(
argv
[
i
]));
}
catch
(
Exception
e
)
{
System
.
err
.
println
(
e
);
}
}
}
}
/** Construct a JGrep object.
* @param patt The pattern to look for
* @param args the command-line options.
*/
public
JGrep
(
String
patt
)
throws
PatternSyntaxException
{
if
(
debug
)
{
System
.
err
.
printf
(
"JGrep.JGrep(%s)%n"
,
patt
);
}
// compile the regular expression
int
caseMode
=
ignoreCase
?
Pattern
.
UNICODE_CASE
|
Pattern
.
CASE_INSENSITIVE
:
0
;
pattern
=
Pattern
.
compile
(
patt
,
caseMode
);
matcher
=
pattern
.
matcher
(
""
);
}
/** Process one command line argument (file or directory)
* @throws FileNotFoundException
*/
public
void
process
(
File
file
)
throws
FileNotFoundException
{
if
(!
file
.
exists
()
||
!
file
.
canRead
())
{
System
.
err
.
println
(
"ERROR: can't read file "
+
file
.
getAbsolutePath
());
return
;
}
if
(
file
.
isFile
())
{
process
(
new
BufferedReader
(
new
FileReader
(
file
)),
file
.
getAbsolutePath
());
return
;
}
if
(
file
.
isDirectory
())
{
if
(!
recursive
)
{
System
.
err
.
println
(
"ERROR: -r not specified but directory given "
+
file
.
getAbsolutePath
());
return
;
}
for
(
File
nf
:
file
.
listFiles
())
{
process
(
nf
);
// "Recursion, n.: See Recursion."
}
return
;
}
System
.
err
.
println
(
"WEIRDNESS: neither file nor directory: "
+
file
.
getAbsolutePath
());
}
/** Do the work of scanning one file
* @param ifile Reader Reader object already open
* @param fileName String Name of the input file
*/
public
void
process
(
Reader
ifile
,
String
fileName
)
{
String
inputLine
;
int
matches
=
0
;
try
(
BufferedReader
reader
=
new
BufferedReader
(
ifile
))
{
while
((
inputLine
=
reader
.
readLine
())
!=
null
)
{
matcher
.
reset
(
inputLine
);
if
(
matcher
.
find
())
{
if
(
listOnly
)
{
// -l, print filename on first match, and we're done
System
.
out
.
println
(
fileName
);
return
;
}
if
(
countOnly
)
{
matches
++;
}
else
{
if
(!
dontPrintFileName
)
{
System
.
out
.
(
fileName
+
": "
);
}
System
.
out
.
println
(
inputLine
);
}
}
else
if
(
inVert
)
{
System
.
out
.
println
(
inputLine
);
}
}
if
(
countOnly
)
System
.
out
.
println
(
matches
+
" matches in "
+
fileName
);
}
catch
(
IOException
e
)
{
System
.
err
.
println
(
e
);
}
}
}
[17] Non-Unix fans fear not, for you can use tools like grep
on Windows systems using one of several packages. One is an open source package alternately called
CygWin (after Cygnus Software) or GnuWin32. Another is Microsoft’s findstr
command for Windows. Or you can use my Grep
program in Printing Lines Containing a Pattern if you don’t have grep on your system. Incidentally, the name grep comes from an ancient Unix line editor command g/RE/p
, the command to find the regex globally in all lines in the edit buffer and print the lines that match—just what
the grep
program does to lines in files.
[18] REDemo
was inspired by (but does not use any code from) a similar program provided with the now-retired Apache Jakarta Regular Expressions package.
[19] On Unix, the shell or command-line interpreter expands *.txt to all the matching filenames before running the program, but the normal Java interpreter does this for you on systems where the shell isn’t energetic or bright enough to do it.
[20] Or a few related Unicode characters, including the next-line (\u0085
), line-separator (\u2028
), and paragraph-separator (\u2029
) characters.
[21] You might think this would hold some kind of world record for complexity in regex competitions, but I’m sure it’s been outdone many times.
Get Java Cookbook, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.