POSIX-Style Regular Expressions

Now that you understand the basics of regular expressions, we can explore the details. POSIX-style regular expressions use the Unix locale system. The locale system provides functions for sorting and identifying characters that let you intelligently work with text from languages other than English. In particular, what constitutes a “letter” varies from language to language (think of à and ç), and there are character classes in POSIX regular expressions that take this into account.

However, POSIX regular expressions are designed for use with only textual data. If your data has a NUL-byte (\x00) in it, the regular expression functions will interpret it as the end of the string, and matching will not take place beyond that point. To do matches against arbitrary binary data, you’ll need to use Perl-compatible regular expressions, which are discussed later in this chapter. Also, as we already mentioned, the Perl-style regular expression functions are often faster than the equivalent POSIX-style ones.

Character Classes

As shown in Table 4-7, POSIX defines a number of named sets of characters that you can use in character classes. The expansions given in Table 4-7 are for English. The actual letters vary from locale to locale.

Table 4-7. POSIX character classes

Class

Description

Expansion

[:alnum:]

Alphanumeric characters

[0-9a-zA-Z]
[:alpha:]

Alphabetic characters (letters)

[a-zA-Z]
[:ascii:]

7-bit ASCII

[\x01-\x7F]
[:blank:]

Horizontal whitespace (space, tab)

[ \t]
[:cntrl:]

Control characters

[\x01-\x1F]
[:digit:]

Digits

[0-9]
[:graph:]

Characters that use ink to print (non-space, non-control)

[^\x01-\x20]
[:lower:]

Lowercase letter

[a-z]
[:print:]

Printable character (graph class plus space and tab)

[\t\x20-\xFF]
[:punct:]

Any punctuation character, such as the period (.) and the semicolon (;)

[-!"#$%&'(  )*+,./:;<=>?@[\\]^_`{|}~]
[:space:]

Whitespace (newline, carriage return, tab, space, vertical tab)

[\n\r\t \x0B]
[:upper:]

Uppercase letter

[A-Z]
[:xdigit:]

Hexadecimal digit

[0-9a-fA-F]

Each [: something :] class can be used in place of a character in a character class. For instance, to find any character that’s a digit, an uppercase letter, or an at sign (@), use the following regular expression:

[@[:digit:][:upper:]]

However, you can’t use a character class as the endpoint of a range:

ereg('[A-[:lower:]]', 'string');        // invalid regular expression

Some locales consider certain character sequences as if they were a single character—these are called collating sequences . To match one of these multicharacter sequences in a character class, enclose it with [. and .]. For example, if your locale has the collating sequence ch, you can match s, t, or ch with this character class:

[st[.ch.]]

The final POSIX extension to character classes is the equivalence class , specified by enclosing the character in [= and =]. Equivalence classes match characters that have the same collating order, as defined in the current locale. For example, a locale may define a, á, and ä as having the same sorting precedence. To match any one of them, the equivalence class is [=a=].

Anchors

An anchor limits a match to a particular location in the string (anchors do not match actual characters in the target string). Table 4-8 lists the anchors supported by POSIX regular expressions.

Table 4-8. POSIX anchors

Anchor

Matches

^

Start of string

$

End of string

[[:<:]]

Start of word

[[:>:]]

End of word

A word boundary is defined as the point between a whitespace character and an identifier (alphanumeric or underscore) character:

ereg('[[:<:]]gun[[:>:]]', 'the Burgundy exploded');    // returns false
ereg('gun', 'the Burgundy exploded');                  // returns true

Note that the beginning and end of a string also qualify as word boundaries.

Functions

There are three categories of functions for POSIX-style regular expressions: matching, replacing, and splitting.

Matching

The ereg( ) function takes a pattern, a string, and an optional array. It populates the array, if given, and returns true or false depending on whether a match for the pattern was found in the string:

$found = ereg(pattern, string [, captured ]);

For example:

ereg('y.*e$', 'Sylvie');       // returns true
ereg('y(.*)e$', 'Sylvie', $a); // returns true, $a is array('Sylvie', 'lvi')

The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern, the second element is the substring that matched the second subpattern, and so on.

The eregi( ) function is a case-insensitive form of ereg( ). Its arguments and return values are the same as those for ereg( ).

Example 4-1 uses pattern matching to determine whether a credit-card number passes the Luhn checksum and whether the digits are appropriate for a card of a specific type.

Example 4-1. Credit-card validator

// The Luhn checksum determines whether a credit-card number is syntactically
// correct; it cannot, however, tell if a card with the number has been issued,
// is currently active, or has enough space left to accept a charge.

function IsValidCreditCard($inCardNumber, $inCardType) {
  // Assume it's okay
  $isValid = true;

  // Strip all non-numbers from the string
  $inCardNumber = ereg_replace('[^[:digit:]]','', $inCardNumber); 

  // Make sure the card number and type match
  switch($inCardType) { 
    case 'mastercard':
      $isValid = ereg('^5[1-5].{14}$', $inCardNumber); 
      break; 

    case 'visa':
      $isValid = ereg('^4.{15}$|^4.{12}$', $inCardNumber); 
      break; 

    case 'amex':
      $isValid = ereg('^3[47].{13}$', $inCardNumber); 
      break; 

    case 'discover':
      $isValid = ereg('^6011.{12}$', $inCardNumber); 
      break; 

    case 'diners':
      $isValid = ereg('^30[0-5].{11}$|^3[68].{12}$', $inCardNumber); 
      break; 

      case 'jcb':
      $isValid = ereg('^3.{15}$|^2131|1800.{11}$', $inCardNumber);
      break; 
  }

  // It passed the rudimentary test; let's check it against the Luhn this time
  if($isValid) {
    // Work in reverse
    $inCardNumber = strrev($inCardNumber);

    // Total the digits in the number, doubling those in odd-numbered positions
    $theTotal = 0;
    for ($i = 0; $i < strlen($inCardNumber); $i++) {
      $theAdder = (int) $inCardNumber[$i];

      // Double the numbers in odd-numbered positions
      if($i % 2) {
        $theAdder << 1;
        if($theAdder > 9) { $theAdder -= 9; }
      }

      $theTotal += $theAdder;
    }

    // Valid cards will divide evenly by 10
    $isValid = (($theTotal % 10) == 0);
  }

  return $isValid;
}

Replacing

The ereg_replace( ) function takes a pattern, a replacement string, and a string in which to search. It returns a copy of the search string, with text that matched the pattern replaced with the replacement string:

$changed = ereg_replace(pattern, replacement, string);

If the pattern has any grouped subpatterns, the matches are accessible by putting the characters \1 through \9 in the replacement string. For example, we can use ereg_replace( ) to replace characters wrapped with [b] and [/b] tags with equivalent HTML tags:

$string = 'It is [b]not[/b] a matter of diplomacy.';
echo ereg_replace ('\[b]([^]]*)\[/b]', '<b>\1</b>', $string);
It is <b>not</b> a matter of diplomacy.

The eregi_replace( ) function is a case-insensitive form of ereg_replace( ). Its arguments and return values are the same as those for ereg_replace( ).

Splitting

The split( ) function uses a regular expression to divide a string into smaller chunks, which are returned as an array. If an error occurs, split( ) returns false. Optionally, you can say how many chunks to return:

$chunks = split(pattern, string [, limit ]);

The pattern matches the text that separates the chunks. For instance, to split out the terms from an arithmetic expression:

$expression = '3*5+i/6-12';
$terms = split('[/+*-]', $expression);
// $terms is array('3', '5', 'i', '6', '12)

If you specify a limit, the last element of the array holds the rest of the string:

$expression = '3*5+i/6-12';
$terms = split('[/+*-]', $expression, 3);
// $terms is array('3', '5', 'i'/6-12)

Get Programming PHP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.