All the programming languages covered by this book provide a
simple, efficient way to check the length of text. For example,
JavaScript strings have a length
property that holds an integer indicating the string’s length.
However, using regular expressions to check text length can be useful
in some situations, particularly when length is only one of multiple
rules that determine whether the subject text fits the desired
pattern. The following regular expression ensures that text is between
1 and 10 characters long, and additionally limits the text to the
uppercase letters A–Z. You can modify the regular expressions to allow
any minimum or maximum text length, or allow characters other than
A–Z.
See Recipe 3.5 for help with implementing this regular expression with other programming languages.
Here’s the breakdown for this very straightforward regex:
^ # Assert position at the beginning of the string. [A-Z] # Match one letter from "A" to "Z"... {1,10} # between 1 and 10 times. $ # Assert position at the end of the string.
Regex options: Free-spacing |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
The ‹^
› and
‹$
› anchors ensure that
the regex matches the entire subject string; otherwise, it could match
10 characters within longer text. The ‹[A-Z]
› character class matches any single
uppercase character from A to Z, and the interval quantifier ‹{1,10}
› repeats the character
class from 1 to 10 times. By combining the interval quantifier with
the surrounding start- and end-of-string anchors, the regex will fail
to match if the subject text’s length falls outside the desired
range.
Note that the character class ‹[A-Z]
› explicitly allows only uppercase letters.
If you want to also allow the lowercase letters a to z, you can either
change the character class to ‹[A-Za-z]
› or apply the case insensitive option.
Recipe 3.4 shows how to do this.
A mistake commonly made by new regular expression users is to
try to save a few characters by using the character class range
‹[A-z]
›. At first
glance, this might seem like a clever trick to allow all uppercase and
lowercase letters. However, the ASCII character table includes several
punctuation characters in positions between the A to Z and a to z
ranges. Hence, ‹[A-z]
›
is actually equivalent to ‹[A-Z[\]^_`a-z]
›.
Because quantifiers such as ‹{1,10}
› apply only to the immediately
preceding element, limiting the number of characters that can be
matched by patterns that include more than a single token requires a
different approach.
As explained in Recipe 2.16,
lookaheads (and their counterpart, lookbehinds) are a special kind
of assertion that, like ‹^
› and ‹$
›, match a position within the subject string
and do not consume any characters. Lookaheads can be either positive
or negative, which means they can check if a pattern follows or does
not follow the current position in the match. A positive lookahead,
written as ‹(?=⋯)
›, can be used at
the beginning of the pattern to ensure that the string is within the
target length range. The remainder of the regex can then validate
the desired pattern without worrying about text length. Here’s a
simple example:
^(?=.{1,10}$).*
Regex options: Dot matches line breaks |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^(?=[\S\s]{1,10}$)[\S\s]*
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
It is important that the ‹$
› anchor appears inside the lookahead because
the maximum length test works only if we ensure that there are no
more characters after we’ve reached the limit. Because the lookahead
at the beginning of the regex enforces the length range, the
following pattern can then apply any additional validation rules. In
this case, the pattern ‹.*
› (or ‹[\S\s]*
› in the version that adds JavaScript
support) is used to simply match the entire subject text with no
added constraints.
This regex uses the “dot matches line breaks” option so that it will work correctly when your subject string contains line breaks. See Recipe 3.4 for details about how to apply this modifier with your programming language. JavaScript doesn’t have a “dot matches line breaks” option, so the second regex uses a character class that matches any character. See Any character including line breaks on page 35 in Recipe 2.4 for more information.
The following regex matches any string that contains between 10 and 100 nonwhitespace characters:
^\s*(?:\S\s*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In Java, PCRE, Python, and Ruby, ‹\s
› matches only ASCII whitespace characters,
and ‹\S
› matches
everything else. In Python, you can make ‹\s
› match all Unicode whitespace by passing
the UNICODE
or U
flag when creating the regex. Developers
using Java, PCRE, and Ruby 1.9 who want to avoid having any Unicode
whitespace count against their character limit can switch to the
following version that takes advantage of Unicode properties
(described in Recipe 2.7):
^[\p{Z}\s]*(?:[^\p{Z}\s][\p{Z}\s]*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9 |
PCRE must be compiled with UTF-8 support for this to work. In
PHP, turn on UTF-8 support with the /u
pattern modifier.
This latter regex combines the Unicode ‹\p{Z}
› Separator property with
the ‹\s
› shorthand for
whitespace. That’s because the characters matched by ‹\p{Z}
› and ‹\s
› do not completely overlap.
‹\s
› includes the
characters at positions 0x09 through 0x0D (tab, line feed, vertical
tab, form feed, and carriage return), which are not assigned the
Separator property by the Unicode standard. By combining ‹\p{Z}
› and ‹\s
› in a character class, you
ensure that all whitespace characters are matched.
In both regexes, the interval quantifier ‹{10,100}
› is applied to the noncapturing group that precedes it,
rather than a single token. The group matches any single
nonwhitespace character followed by zero or more whitespace
characters. The interval quantifier can reliably track how many
nonwhitespace characters are matched because exactly one
nonwhitespace character is matched during each iteration.
The following regex is very similar to the previous example of limiting the number of nonwhitespace characters, except that each repetition matches an entire word rather than a single, nonwhitespace character. It matches between 10 and 100 words, skipping past any nonword characters, including punctuation and whitespace:
^\W*(?:\w+\b\W*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In Java, JavaScript, PCRE, and Ruby, the word character token
‹\w
› in this regex
will only match the ASCII characters A–Z, a–z, 0–9, and _, and
therefore this cannot correctly count words that contain non-ASCII
letters and numbers. In .NET and Perl, ‹\w
› is based on the Unicode table (as is its
inverse, ‹\W
›, and the
word boundary ‹\b
›)
and will match letters and digits from all Unicode scripts. In
Python, you can choose whether these tokens should be Unicode-based
or not, based on whether you pass the UNICODE
or U
flag when creating the regex.
If you want to count words that contain non-ASCII letters and numbers, the following regexes provide this capability for additional regex flavors:
^[^\p{L}\p{N}_]*(?:[\p{L}\p{N}_]+\b[^\p{L}\p{N}_]*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, Perl |
^[^\p{L}\p{N}_]*(?:[\p{L}\p{N}_]+(?:[^\p{L}\p{N}_]+|$)){10,100}$
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9 |
PCRE must be compiled with UTF-8 support for this to work. In
PHP, turn on UTF-8 support with the /u
pattern modifier.
As noted, the reason for these different (but equivalent) regexes is the varying handling of the word character and word boundary tokens explained in Word Characters in Recipe 2.6.
The last two regexes use character classes that include the
separate Unicode properties for letters and numbers (‹\p{L}
› and ‹\p{N}
›), and manually add the
underscore character to each class to make them equivalent to the
earlier regex that relied on ‹\w
› and ‹\W
›.
Each repetition of the noncapturing group in the first two of
these three regexes matches an entire word followed by zero or more
nonword characters. The ‹\W
› (or ‹[^\p{L}\p{N}_]
›) token inside the group is
allowed to repeat zero times in case the string ends with a word
character. However, since this effectively makes the nonword
character sequence optional throughout the matching process, the
word boundary assertion ‹\b
› is needed between ‹\w
› and ‹\W
› (or ‹[\p{L}\p{N}_]
› and ‹[^\p{L}\p{N}_]
›), to ensure that each
repetition of the group really matches an entire word. Without the
word boundary, a single repetition would be allowed to match any
part of a word, with subsequent repetitions matching additional
pieces.
The third version of the regex (which adds support for PCRE
and Ruby 1.9) works a bit differently. It uses a plus (one or more)
instead of an asterisk (zero or more) quantifier, and explicitly
allows matching zero characters only if the matching process has
reached the end of the string. This allows us to avoid the word
boundary token, which is necessary to ensure accuracy, since
‹\b
› is not
Unicode-enabled in PCRE or Ruby. ‹\b
› is Unicode-enabled in
Java, even though Java’s ‹\w
› is not.
Unfortunately, none of these options allow JavaScript or Ruby 1.8 to correctly handle words that use non-ASCII characters. A possible workaround is to reframe the regex to count whitespace rather than word character sequences, as shown here:
^\s*(?:\S+(?:\s+|$)){10,100}$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, Perl, PCRE, Python, Ruby |
In many cases, this will work the same as the previous solutions, although it’s not exactly equivalent. For example, one difference is that compounds joined by a hyphen (such as “far-reaching”) will now be counted as one word instead of two.
Get Regular Expressions Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.