If you need more complex searching functionality than the previous methods provide, you can use regular expressions. A regular expression is a string that represents a pattern. The regular expression functions compare that pattern to another string and see if any of the string matches the pattern. Some functions tell you whether there was a match, while others make changes to the string.
PHP provides support for two different types of regular expressions:
POSIX and Perl-compatible. POSIX regular expressions are less
powerful, and sometimes slower, than the Perl-compatible functions,
but can be easier to read. There are three uses for regular
expressions: matching, which can also be used to
extract information from a string; substituting new text
for matching text; and
splitting a string into an
array of smaller chunks. PHP has functions for all three behaviors
for both Perl and POSIX regular expressions. For instance, ereg( )
does a POSIX match, while preg_match( )
does a Perl match.
Fortunately, there are a number of similarities between basic POSIX
and Perl regular expressions, so we’ll cover those
before delving into the details of each library.
Most characters in a regular expression are literal characters,
meaning that they match only themselves. For instance, if you search
for the regular expression "cow"
in the string "Dave was a cowhand"
,
you get a match because "cow"
occurs in that string.
Some characters, though, have
special meanings in regular
expressions. For instance, a caret (^
) at the beginning of a
regular expression indicates that it must match the beginning of the
string (or, more precisely, anchors the regular
expression to the beginning of the string):
ereg('^cow', 'Dave was a cowhand'); // returns false ereg('^cow', 'cowabunga!'); // returns true
Similarly, a dollar sign ($
) at the end
of a regular expression means that it must match the end of the
string (i.e., anchors the regular expression to the end of the
string):
ereg('cow$', 'Dave was a cowhand'); // returns false ereg('cow$', "Don't have a cow"); // returns true
A period
(.
) in a regular expression matches any single
character:
ereg('c.t', 'cat'); // returns true ereg('c.t', 'cut'); // returns true ereg('c.t', 'c t'); // returns true ereg('c.t', 'bat'); // returns false ereg('c.t', 'ct'); // returns false
If you want to match one of these special characters (called a metacharacter), you have to escape it with a backslash:
ereg('\$5\.00', 'Your bill is $5.00 exactly'); // returns true ereg('$5.00', 'Your bill is $5.00 exactly'); // returns false
Regular expressions are
case-sensitive by default, so the
regular expression "cow"
doesn’t
match the string "COW"
. If you want to perform a
case-insensitive POSIX-style match, you can use the eregi( )
function. With Perl-style regular expressions, you still
use preg_match( )
, but specify a flag to indicate
a case-insensitive match (as you’ll see when we
discuss Perl-style regular expressions in detail later in this
chapter).
So far, we haven’t
done anything we couldn’t have done with the string
functions we’ve already seen, like strstr( )
. The real power of regular expressions comes from their
ability to specify abstract patterns that can match many different
character sequences. You can specify three basic types of abstract
patterns in a regular expression:
A set of acceptable characters that can appear in the string (e.g., alphabetic characters, numeric characters, specific punctuation characters)
A set of alternatives for the string (e.g.,
"com"
,"edu"
,"net"
, or"org"
)A repeating sequence in the string (e.g., at least one but no more than five numeric characters)
These three kinds of patterns can be combined in countless ways, to create regular expressions that match such things as valid phone numbers and URLs.
To specify a set of acceptable characters in your pattern, you can either build a character class yourself or use a predefined one. You can build your own character class by enclosing the acceptable characters in square brackets:
ereg('c[aeiou]t', 'I cut my hand'); // returns true ereg('c[aeiou]t', 'This crusty cat'); // returns true ereg('c[aeiou]t', 'What cart?'); // returns false ereg('c[aeiou]t', '14ct gold'); // returns false
The regular expression engine finds a "c"
, then
checks that the next character is one of "a"
,
"e"
, "i"
,
"o"
, or "u"
. If it
isn’t a vowel, the match fails and the engine goes
back to looking for another "c"
. If a vowel is
found, though, the engine then checks that the next character is a
"t"
. If it is, the engine is at the end of the
match and so returns true
. If the next character
isn’t a "t"
, the engine goes back
to looking for another "c"
.
You can negate a character class with a caret (^
) at the
start:
ereg('c[^aeiou]t', 'I cut my hand'); // returns false ereg('c[^aeiou]t', 'Reboot chthon'); // returns true ereg('c[^aeiou]t', '14ct gold'); // returns false
In this case, the regular expression engine is looking for a
"c"
, followed by a character that
isn’t a vowel, followed by a "t"
.
You can define a range of characters with a
hyphen
(-
). This simplifies character classes like
“all letters” and
“all digits”:
ereg('[0-9]%', 'we are 25% complete'); // returns true ereg('[0123456789]%', 'we are 25% complete'); // returns true ereg('[a-z]t', '11th'); // returns false ereg('[a-z]t', 'cat'); // returns true ereg('[a-z]t', 'PIT'); // returns false ereg('[a-zA-Z]!', '11!'); // returns false ereg('[a-zA-Z]!', 'stop!'); // returns true
When you are specifying a character class, some special characters
lose their meaning, while others take on new meaning. In particular,
the $
anchor and the period lose their meaning in
a character class, while the ^ character is no longer an anchor but
negates the character class if it is the first character after the
open bracket. For instance, [^\]]
matches any
character that is not a closing bracket, while
[$.^]
matches any dollar sign, period, or caret.
The various regular expression libraries define shortcuts for
character classes, including digits, alphabetic characters, and
whitespace. The actual syntax for these shortcuts differs between
POSIX-style and Perl-style regular expressions. For instance, with
POSIX, the whitespace character class is
"[[:space:]]"
, while with Perl it is
"\s"
.
You can use the
vertical pipe (|
)
character to specify alternatives in a regular expression:
ereg('cat|dog', 'the cat rubbed my legs'); // returns true ereg('cat|dog', 'the dog rubbed my legs'); // returns true ereg('cat|dog', 'the rabbit rubbed my legs'); // returns false
The precedence of alternation can be a surprise:
'^cat|dog$'
selects from '^cat'
and 'dog$'
, meaning that it matches a line that
either starts with "cat"
or ends with
"dog"
. If you want a line that contains just
"cat"
or "dog"
, you need to use
the regular expression '^(cat|dog)$'
.
You can combine character classes and alternation to, for example, check for strings that don’t start with a capital letter:
ereg('^([a-z]|[0-9])', 'The quick brown fox'); // returns false ereg('^([a-z]|[0-9])', 'jumped over'); // returns true ereg('^([a-z]|[0-9])', '10 lazy dogs'); // returns true
To specify a repeating pattern, you use something called a quantifier. The quantifier goes after the pattern that’s repeated and says how many times to repeat that pattern. Table 4-6 shows the quantifiers that are supported by both POSIX and Perl regular expressions.
Table 4-6. Regular expression quantifiers
Quantifier |
Meaning |
---|---|
|
0 or 1 |
|
0 or more |
|
1 or more |
|
Exactly |
|
At least |
|
At least |
To repeat a single character, simply put the quantifier after the character:
ereg('ca+t', 'caaaaaaat'); // returns true ereg('ca+t', 'ct'); // returns false ereg('ca?t', 'caaaaaaat'); // returns false ereg('ca*t', 'ct'); // returns true
With quantifiers and character classes, we can actually do something useful, like matching valid U.S. telephone numbers:
ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', '303-555-1212'); // returns true ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', '64-9-555-1234'); // returns false
You can use parentheses to group bits of a regular expression together to be treated as a single unit called a subpattern:
ereg('a (very )+big dog', 'it was a very very big dog'); // returns true ereg('^(cat|dog)$', 'cat'); // returns true ereg('^(cat|dog)$', 'dog'); // returns true
The parentheses also cause the substring that matches the subpattern to be captured. If you pass an array as the third argument to a match function, the array is populated with any captured substrings:
ereg('([0-9]+)', 'You have 42 magic beans', $captured); // returns true and populates $captured
The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern (if there is one), the second element is the substring that matched the second subpattern, and so on.
Get Programming PHP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.