Now that you understand the basics
of regular expressions, we can explore the details. POSIX-style
regular expressions use the
Unix locale
system. The locale system provides functions for sorting and
identifying characters that let you intelligently work with text from
languages other than English. In particular, what constitutes a
“letter” varies from language to
language (think of à
and
ç
), and there are character classes in
POSIX regular expressions that take this into account.
However, POSIX regular expressions are designed for use with only
textual data. If your data has a NUL-byte (\x00
)
in it, the regular expression functions will interpret it as the end
of the string, and matching will not take place beyond that point. To
do matches against arbitrary binary data, you’ll
need to use Perl-compatible regular expressions, which are discussed
later in this chapter. Also, as we already mentioned, the Perl-style
regular expression functions are often faster than the equivalent
POSIX-style ones.
As shown in Table 4-7, POSIX defines a number of named sets of characters that you can use in character classes. The expansions given in Table 4-7 are for English. The actual letters vary from locale to locale.
Table 4-7. POSIX character classes
Each
[:
something
:]
class can be used in place of a character in a character class. For
instance, to find any character that’s a digit, an
uppercase letter, or an at sign (@
), use the
following regular expression:
[@[:digit:][:upper:]]
However, you can’t use a character class as the endpoint of a range:
ereg('[A-[:lower:]]', 'string'); // invalid regular expression
Some locales consider certain character sequences as if they were a
single character—these are called collating
sequences
. To match one of these multicharacter
sequences in a character class, enclose it with [.
and .]
. For example, if your locale has the
collating sequence ch
, you can match
s
, t
, or ch
with this character class:
[st[.ch.]]
The final POSIX extension to character classes is the
equivalence class
, specified by enclosing the character in
[=
and =]
. Equivalence classes
match characters that have the same collating order, as defined in
the current locale. For example, a locale may define
a
, á
, and
ä
as having the same sorting
precedence. To match any one of them, the equivalence class is
[=a=]
.
An anchor limits a match to a particular location in the string (anchors do not match actual characters in the target string). Table 4-8 lists the anchors supported by POSIX regular expressions.
A word boundary is defined as the point between a whitespace character and an identifier (alphanumeric or underscore) character:
ereg('[[:<:]]gun[[:>:]]', 'the Burgundy exploded'); // returns false ereg('gun', 'the Burgundy exploded'); // returns true
Note that the beginning and end of a string also qualify as word boundaries.
There are three categories of functions for POSIX-style regular expressions: matching, replacing, and splitting.
The
ereg( )
function takes a pattern, a string,
and an optional array. It populates the array, if given, and returns
true
or false
depending on
whether a match for the pattern was found in the string:
$found = ereg(pattern
,string
[,captured
]);
For example:
ereg('y.*e$', 'Sylvie'); // returns true ereg('y(.*)e$', 'Sylvie', $a); // returns true, $a is array('Sylvie', 'lvi')
The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern, the second element is the substring that matched the second subpattern, and so on.
The eregi( )
function is a
case-insensitive form of ereg( )
. Its arguments
and return values are the same as those for ereg( )
.
Example 4-1 uses pattern matching to determine whether a credit-card number passes the Luhn checksum and whether the digits are appropriate for a card of a specific type.
Example 4-1. Credit-card validator
// The Luhn checksum determines whether a credit-card number is syntactically // correct; it cannot, however, tell if a card with the number has been issued, // is currently active, or has enough space left to accept a charge. function IsValidCreditCard($inCardNumber, $inCardType) { // Assume it's okay $isValid = true; // Strip all non-numbers from the string $inCardNumber = ereg_replace('[^[:digit:]]','', $inCardNumber); // Make sure the card number and type match switch($inCardType) { case 'mastercard': $isValid = ereg('^5[1-5].{14}$', $inCardNumber); break; case 'visa': $isValid = ereg('^4.{15}$|^4.{12}$', $inCardNumber); break; case 'amex': $isValid = ereg('^3[47].{13}$', $inCardNumber); break; case 'discover': $isValid = ereg('^6011.{12}$', $inCardNumber); break; case 'diners': $isValid = ereg('^30[0-5].{11}$|^3[68].{12}$', $inCardNumber); break; case 'jcb': $isValid = ereg('^3.{15}$|^2131|1800.{11}$', $inCardNumber); break; } // It passed the rudimentary test; let's check it against the Luhn this time if($isValid) { // Work in reverse $inCardNumber = strrev($inCardNumber); // Total the digits in the number, doubling those in odd-numbered positions $theTotal = 0; for ($i = 0; $i < strlen($inCardNumber); $i++) { $theAdder = (int) $inCardNumber[$i]; // Double the numbers in odd-numbered positions if($i % 2) { $theAdder << 1; if($theAdder > 9) { $theAdder -= 9; } } $theTotal += $theAdder; } // Valid cards will divide evenly by 10 $isValid = (($theTotal % 10) == 0); } return $isValid; }
The ereg_replace( )
function takes a
pattern, a replacement string, and a string in which to search. It
returns a copy of the search string, with text that matched the
pattern replaced with the replacement string:
$changed = ereg_replace(pattern
,replacement
,string
);
If the pattern has any
grouped subpatterns, the matches are
accessible by putting the characters \1
through
\9
in the replacement string. For example, we can
use ereg_replace( )
to replace characters wrapped
with [b]
and [/b]
tags with
equivalent HTML tags:
$string = 'It is [b]not[/b] a matter of diplomacy.';
echo ereg_replace ('\[b]([^]]*)\[/b]', '<b>\1</b>', $string);
It is <b>not</b> a matter of diplomacy.
The eregi_replace( )
function is a
case-insensitive
form of ereg_replace( )
. Its arguments and return values are the same
as those for ereg_replace( )
.
The
split( )
function uses a regular
expression to divide a string into smaller chunks, which are returned
as an array. If an error occurs, split( )
returns false
. Optionally,
you can say how many chunks to return:
$chunks = split(pattern
,string
[,limit
]);
The pattern matches the text that separates the chunks. For instance, to split out the terms from an arithmetic expression:
$expression = '3*5+i/6-12'; $terms = split('[/+*-]', $expression); // $terms is array('3', '5', 'i', '6', '12)
If you specify a limit, the last element of the array holds the rest of the string:
$expression = '3*5+i/6-12'; $terms = split('[/+*-]', $expression, 3); // $terms is array('3', '5', 'i'/6-12)
Get Programming PHP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.