4.9. Limit the Length of Text

Problem

You want to test whether a string is composed of between 1 and 10 letters from A to Z.

Solution

All the programming languages covered by this book provide a simple, efficient way to check the length of text. For example, JavaScript strings have a length property that holds an integer indicating the string’s length. However, using regular expressions to check text length can be useful in some situations, particularly when length is only one of multiple rules that determine whether the subject text fits the desired pattern. The following regular expression ensures that text is between 1 and 10 characters long, and additionally limits the text to the uppercase letters A–Z. You can modify the regular expressions to allow any minimum or maximum text length, or allow characters other than A–Z.

Regular expression

^[A-Z]{1,10}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Perl

if ($ARGV[0] =~ /^[A-Z]{1,10}$/) {
    print "Input is valid\n";
} else {
    print "Input is invalid\n";
}

Other programming languages

See Recipe 3.5 for help with implementing this regular expression with other programming languages.

Discussion

Here’s the breakdown for this very straightforward regex:

^         # Assert position at the beginning of the string.
[A-Z]     # Match one letter from "A" to "Z"...
  {1,10}  #   between 1 and 10 times.
$         # Assert position at the end of the string.
Regex options: Free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

The ^ and $ anchors ensure that the regex matches the entire subject string; otherwise, it could match 10 characters within longer text. The [A-Z] character class matches any single uppercase character from A to Z, and the interval quantifier {1,10} repeats the character class from 1 to 10 times. By combining the interval quantifier with the surrounding start- and end-of-string anchors, the regex will fail to match if the subject text’s length falls outside the desired range.

Note that the character class [A-Z] explicitly allows only uppercase letters. If you want to also allow the lowercase letters a to z, you can either change the character class to [A-Za-z] or apply the case insensitive option. Recipe 3.4 shows how to do this.

A mistake commonly made by new regular expression users is to try to save a few characters by using the character class range [A-z]. At first glance, this might seem like a clever trick to allow all uppercase and lowercase letters. However, the ASCII character table includes several punctuation characters in positions between the A to Z and a to z ranges. Hence, [A-z] is actually equivalent to [A-Z[\]^_`a-z].

Variations

Limit the length of an arbitrary pattern

Because quantifiers such as {1,10} apply only to the immediately preceding element, limiting the number of characters that can be matched by patterns that include more than a single token requires a different approach.

As explained in Recipe 2.16, lookaheads (and their counterpart, lookbehinds) are a special kind of assertion that, like ^ and $, match a position within the subject string and do not consume any characters. Lookaheads can be either positive or negative, which means they can check if a pattern follows or does not follow the current position in the match. A positive lookahead, written as (?=), can be used at the beginning of the pattern to ensure that the string is within the target length range. The remainder of the regex can then validate the desired pattern without worrying about text length. Here’s a simple example:

^(?=.{1,10}$).*
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?=[\S\s]{1,10}$)[\S\s]*
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

It is important that the $ anchor appears inside the lookahead because the maximum length test works only if we ensure that there are no more characters after we’ve reached the limit. Because the lookahead at the beginning of the regex enforces the length range, the following pattern can then apply any additional validation rules. In this case, the pattern .* (or [\S\s]* in the version that adds JavaScript support) is used to simply match the entire subject text with no added constraints.

This regex uses the “dot matches line breaks” option so that it will work correctly when your subject string contains line breaks. See Recipe 3.4 for details about how to apply this modifier with your programming language. JavaScript doesn’t have a “dot matches line breaks” option, so the second regex uses a character class that matches any character. See Any character including line breaks on page 35 in Recipe 2.4 for more information.

Limit the number of nonwhitespace characters

The following regex matches any string that contains between 10 and 100 nonwhitespace characters:

^\s*(?:\S\s*){10,100}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In Java, PCRE, Python, and Ruby, \s matches only ASCII whitespace characters, and \S matches everything else. In Python, you can make \s match all Unicode whitespace by passing the UNICODE or U flag when creating the regex. Developers using Java, PCRE, and Ruby 1.9 who want to avoid having any Unicode whitespace count against their character limit can switch to the following version that takes advantage of Unicode properties (described in Recipe 2.7):

^[\p{Z}\s]*(?:[^\p{Z}\s][\p{Z}\s]*){10,100}$
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9

PCRE must be compiled with UTF-8 support for this to work. In PHP, turn on UTF-8 support with the /u pattern modifier.

This latter regex combines the Unicode \p{Z} Separator property with the \s shorthand for whitespace. That’s because the characters matched by \p{Z} and \s do not completely overlap. \s includes the characters at positions 0x09 through 0x0D (tab, line feed, vertical tab, form feed, and carriage return), which are not assigned the Separator property by the Unicode standard. By combining \p{Z} and \s in a character class, you ensure that all whitespace characters are matched.

In both regexes, the interval quantifier {10,100} is applied to the noncapturing group that precedes it, rather than a single token. The group matches any single nonwhitespace character followed by zero or more whitespace characters. The interval quantifier can reliably track how many nonwhitespace characters are matched because exactly one nonwhitespace character is matched during each iteration.

Limit the number of words

The following regex is very similar to the previous example of limiting the number of nonwhitespace characters, except that each repetition matches an entire word rather than a single, nonwhitespace character. It matches between 10 and 100 words, skipping past any nonword characters, including punctuation and whitespace:

^\W*(?:\w+\b\W*){10,100}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In Java, JavaScript, PCRE, and Ruby, the word character token \w in this regex will only match the ASCII characters A–Z, a–z, 0–9, and _, and therefore this cannot correctly count words that contain non-ASCII letters and numbers. In .NET and Perl, \w is based on the Unicode table (as is its inverse, \W, and the word boundary \b) and will match letters and digits from all Unicode scripts. In Python, you can choose whether these tokens should be Unicode-based or not, based on whether you pass the UNICODE or U flag when creating the regex.

If you want to count words that contain non-ASCII letters and numbers, the following regexes provide this capability for additional regex flavors:

^[^\p{L}\p{N}_]*(?:[\p{L}\p{N}_]+\b[^\p{L}\p{N}_]*){10,100}$
Regex options: None
Regex flavors: .NET, Java, Perl
^[^\p{L}\p{N}_]*(?:[\p{L}\p{N}_]+(?:[^\p{L}\p{N}_]+|$)){10,100}$
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9

PCRE must be compiled with UTF-8 support for this to work. In PHP, turn on UTF-8 support with the /u pattern modifier.

As noted, the reason for these different (but equivalent) regexes is the varying handling of the word character and word boundary tokens explained in Word Characters in Recipe 2.6.

The last two regexes use character classes that include the separate Unicode properties for letters and numbers (\p{L} and \p{N}), and manually add the underscore character to each class to make them equivalent to the earlier regex that relied on \w and \W.

Each repetition of the noncapturing group in the first two of these three regexes matches an entire word followed by zero or more nonword characters. The \W (or [^\p{L}\p{N}_]) token inside the group is allowed to repeat zero times in case the string ends with a word character. However, since this effectively makes the nonword character sequence optional throughout the matching process, the word boundary assertion \b is needed between \w and \W (or [\p{L}\p{N}_] and [^\p{L}\p{N}_]), to ensure that each repetition of the group really matches an entire word. Without the word boundary, a single repetition would be allowed to match any part of a word, with subsequent repetitions matching additional pieces.

The third version of the regex (which adds support for PCRE and Ruby 1.9) works a bit differently. It uses a plus (one or more) instead of an asterisk (zero or more) quantifier, and explicitly allows matching zero characters only if the matching process has reached the end of the string. This allows us to avoid the word boundary token, which is necessary to ensure accuracy, since \b is not Unicode-enabled in PCRE or Ruby. \b is Unicode-enabled in Java, even though Java’s \w is not.

Unfortunately, none of these options allow JavaScript or Ruby 1.8 to correctly handle words that use non-ASCII characters. A possible workaround is to reframe the regex to count whitespace rather than word character sequences, as shown here:

^\s*(?:\S+(?:\s+|$)){10,100}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, Perl, PCRE, Python, Ruby

In many cases, this will work the same as the previous solutions, although it’s not exactly equivalent. For example, one difference is that compounds joined by a hyphen (such as “far-reaching”) will now be counted as one word instead of two.

See Also

Recipes 4.8 and 4.10

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.