Regular Expression Syntax


You need to learn the syntax of regular expressions.


Consult Chapter 4 for a list of the regular expression characters that the Apache Regular Expression API matches.

Table 4-2. Regular expression syntax


Will match:





The letter a (and similarly for any other Unicode character not listed in this table)



Start of line/string



End of line/string



Any one character



“Character class”; any one character from those listed



Any one character not from those listed


Normal (greedy) multipliers (“greedy closures”)



Multiplier (closure) for from m to n repetitions



Multiplier for from m repetitions on up



Multiplier for 0 up to n repetitions



Multiplier for 0 or more repetitions

Short for {0,}


Multiplier for 1 or more repetitions

Short for {1,}


Multiplier for 0 or 1 repetitions

Short for {0,1}

Reluctant (non-greedy) multipliers (“reluctant closures”)



Reluctant multiplier: 0 or more



Reluctant multiplier: 1 or more



Reluctant multiplier: 0 or 1 times


Alternation and grouping


( )





Escapes and shorthands



Escape character: turns metacharacters off, and turns following alphabetics (t, w, d, and s) into metacharacters.



Tab character



Character in a word

Use \w+ for a word


Numeric digit

Use \d+ for a number



Space, tab, etc., as determined by java.lang.Character.isWhitespace( )

\W, \D, \S

Inverse of above (\W is a non-word character, etc.)


POSIX-style character classes



Alphanumeric characters



Alphabetic characters



Space and tab characters



Space characters



Control characters



Numeric digit characters



Printable and visible characters (not spaces)



Printable characters



Punctuation characters



Lowercase characters



Uppercase characters



Hexadecimal digit characters



Start of a Java language identifier

Not in POSIX


Part of a Java identifier

Not in POSIX

These pattern characters can be used in any combination that makes sense. For example, a+ means any number of occurrences of the letter a, from one up to a million or a gazillion. The pattern Mrs?\. matches Mr. or Mrs.. And, .*means “any character, any number of times,” and is similar in meaning to most command-line interpreters’ meaning of *.

It’s important to remember that REs will match anyplace possible in the input, and that patterns ending in a greedy closure will consume as much as possible without compromising any other subexpressions.

Also, unlike some RE packages, the Apache package was designed to handle Unicode characters from the beginning. Actually, it came for free, as its basic units are the Java char and String variable, which are Unicode-based. In fact, the standard Java escape sequence \unnnn is used to specify a Unicode character in the pattern. And we use methods of java.lang.Character to determine Unicode character properties, such as whether or not a given character is a space.

Get Java Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.