Regular Expressions
If you need more complex searching functionality than the previous methods provide, you can use regular expressions . A regular expression is a string that represents a pattern. The regular expression functions compare that pattern to another string and see if any of the string matches the pattern. Some functions tell you whether there was a match, while others make changes to the string.
PHP provides support for two different types of regular
expressions: POSIX and Perl-compatible . POSIX regular expressions are less powerful, and
sometimes slower, than the Perl-compatible functions, but can be easier
to read. There are three uses for regular expressions: matching, which
can also be used to extract information from a string; substituting new
text for matching text; and splitting a string into an array of smaller
chunks. PHP has functions for all three behaviors for both Perl and
POSIX regular expressions. For instance, ereg(
)
does a POSIX match, while preg_match( )
does a Perl match. Fortunately,
there are a number of similarities between basic POSIX and Perl regular
expressions, so we’ll cover those before delving into the details of
each library.
The Basics
Most characters in a regular expression are literal
characters, meaning that they match only themselves. For instance, if
you search for the regular expression "cow"
in the string "Dave was a cowhand
,"
you get a match because "cow"
occurs in that string.
Some characters have special meanings in regular expressions.
For instance, a caret (^
) at the
beginning of a regular expression indicates that it must match the
beginning of the string (or, more precisely,
anchors the regular expression to the beginning
of the string):
ereg('^cow', 'Dave was a cowhand'); // returns false ereg('^cow', 'cowabunga!'); // returns true
Similarly, a dollar sign ($
)
at the end of a regular expression means that it must match the end of
the string (i.e., anchors the regular expression to the end of the
string):
ereg('cow$', 'Dave was a cowhand'); // returns false ereg('cow$', "Don't have a cow"); // returns true
A period (.
) in a regular
expression matches any single character:
ereg('c.t', 'cat'); // returns true ereg('c.t', 'cut'); // returns true ereg('c.t', 'c t'); // returns true ereg('c.t', 'bat'); // returns false ereg('c.t', 'ct'); // returns false
If you want to match one of these special characters (called a metacharacter), you have to escape it with a backslash:
ereg('\$5\.00', 'Your bill is $5.00 exactly'); // returns true ereg('$5.00', 'Your bill is $5.00 exactly'); // returns false
Regular expressions are case-sensitive by default, so the
regular expression "cow"
doesn’t
match the string "COW"
. If you want
to perform a case-insensitive POSIX-style match, you can use the
eregi( )
function. With Perl-style
regular expressions , you still use preg_match(
)
, but specify a flag to indicate a case-insensitive match
(as you’ll see when we discuss Perl-style regular expressions in
detail later in this chapter).
So far, we haven’t done anything we couldn’t have done with the
string functions we’ve already seen, like strstr( )
. The real power of regular
expressions comes from their ability to specify abstract patterns that
can match many different character sequences. You can specify three
basic types of abstract patterns in a regular expression:
A set of acceptable characters that can appear in the string (e.g., alphabetic characters, numeric characters, specific punctuation characters)
A set of alternatives for the string (e.g.,
"com"
,"edu"
,"net"
, or"org"
)A repeating sequence in the string (e.g., at least one but no more than five numeric characters)
These three kinds of patterns can be combined in countless ways to create regular expressions that match such things as valid phone numbers and URLs.
Character Classes
To specify a set of acceptable characters in your pattern, you can either build a character class yourself or use a predefined one. You can build your own character class by enclosing the acceptable characters in square brackets:
ereg('c[aeiou]t', 'I cut my hand'); // returns true ereg('c[aeiou]t', 'This crusty cat'); // returns true ereg('c[aeiou]t', 'What cart?'); // returns false ereg('c[aeiou]t', '14ct gold'); // returns false
The regular expression engine finds a "c"
, then checks that the next character is
one of "a"
, "e"
, "i"
,
"o"
, or "u"
. If it isn’t a vowel, the match fails
and the engine goes back to looking for another "c"
. If a vowel is found, the engine checks
that the next character is a "t"
.
If it is, the engine is at the end of the match and returns true
. If the next character isn’t a "t"
, the engine goes back to looking for
another "c"
.
You can negate a character class with a caret (^
) at the start:
ereg('c[^aeiou]t', 'I cut my hand'); // returns false ereg('c[^aeiou]t', 'Reboot chthon'); // returns true ereg('c[^aeiou]t', '14ct gold'); // returns false
In this case, the regular expression engine is looking for a
"c"
followed by a character that
isn’t a vowel, followed by a "t"
.
You can define a range of characters with a hyphen (-
). This simplifies character
classes like “all letters” and “all digits”:
ereg('[0-9]%', 'we are 25% complete'); // returns true ereg('[0123456789]%', 'we are 25% complete'); // returns true ereg('[a-z]t', '11th'); // returns false ereg('[a-z]t', 'cat'); // returns true ereg('[a-z]t', 'PIT'); // returns false ereg('[a-zA-Z]!', '11!'); // returns false ereg('[a-zA-Z]!', 'stop!'); // returns true
When you are specifying a character class, some special
characters lose their meaning, while others take on new meanings. In
particular, the $
anchor and the
period lose their meaning in a character class, while the ^ character
is no longer an anchor but negates the character class if it is the
first character after the open bracket. For instance, [^\]]
matches any character that is not a
closing bracket, while [$.^]
matches any dollar sign, period, or caret.
The various regular expression libraries define shortcuts for
character classes, including digits, alphabetic characters, and
whitespace. The actual syntax for these shortcuts differs between
POSIX-style and Perl-style regular expressions . For instance, with POSIX, the whitespace character
class is "[[:space:]]"
, while with
Perl it is "\s"
.
Alternatives
You can use the vertical pipe (|
) character to specify alternatives
in a regular expression:
ereg('cat|dog', 'the cat rubbed my legs'); // returns true ereg('cat|dog', 'the dog rubbed my legs'); // returns true ereg('cat|dog', 'the rabbit rubbed my legs'); // returns false
The precedence of alternation can be a surprise: '^cat|dog$'
selects from '^cat'
and 'dog$'
, meaning that it matches a line that
either starts with "cat"
or ends
with "dog"
. If you want a line that
contains just "cat"
or "dog"
, you need to use the regular
expression '^(cat|dog)$'
.
You can combine character classes and alternation to, for example, check for strings that don’t start with a capital letter:
ereg('^([a-z]|[0-9])', 'The quick brown fox'); // returns false ereg('^([a-z]|[0-9])', 'jumped over'); // returns true ereg('^([a-z]|[0-9])', '10 lazy dogs'); // returns true
Repeating Sequences
To specify a repeating pattern, you use something called a quantifier. The quantifier goes after the pattern that’s repeated and says how many times to repeat that pattern. Table 4-6 shows the quantifiers that are supported by both POSIX and Perl regular expressions.
Quantifier | Meaning |
| 0 or 1 |
| 0 or more |
| 1 or more |
| Exactly |
| At least
|
| At least
|
To repeat a single character, simply put the quantifier after the character:
ereg('ca+t', 'caaaaaaat'); // returns true ereg('ca+t', 'ct'); // returns false ereg('ca?t', 'caaaaaaat'); // returns false ereg('ca*t', 'ct'); // returns true
With quantifiers and character classes, we can actually do something useful, like matching valid U.S. telephone numbers:
ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', '303-555-1212'); // returns true ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', '64-9-555-1234'); // returns false
Subpatterns
You can use parentheses to group bits of a regular expression together to be treated as a single unit called a subpattern:
ereg('a (very )+big dog', 'it was a very very big dog'); // returns true ereg('^(cat|dog)$', 'cat'); // returns true ereg('^(cat|dog)$', 'dog'); // returns true
The parentheses also cause the substring that matches the subpattern to be captured. If you pass an array as the third argument to a match function, the array is populated with any captured substrings:
ereg('([0-9]+)', 'You have 42 magic beans', $captured); // returns true and populates $captured
The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern (if there is one), and the second element is the substring that matched the second subpattern, and so on.
Get Programming PHP, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.