4.10. Limit the Number of Lines in Text

Problem

You need to check whether a string is comprised of five or fewer lines, without regard for how many total characters appear in the string.

Solution

The exact characters or character sequences used as line separators can vary depending on your operating system’s convention, application or user preferences, and so on. Crafting an ideal solution therefore raises questions about what conventions should be supported to indicate the start of a new line. The following solutions support the standard MS-DOS/Windows (\r\n), legacy Mac OS (\r), and Unix/Linux/OS X (\n) line break conventions.

Regular expression

The following three flavor-specific regexes contain two differences. The first regex uses atomic groups, written as (?>), instead of noncapturing groups, written as (?:), because they have the potential to provide a minor efficiency improvement here for the regex flavors that support them. Python and JavaScript do not support atomic groups, so they are not used with those flavors. The other difference is the tokens used to assert position at the beginning and end of the string (\A or ^ for the beginning of the string, and \z, \Z, or $ for the end). The reasons for this variation are discussed in depth later in this recipe. All three flavor-specific regexes match exactly the same strings:

\A(?>(?>\r\n?|\n)?[^\r\n]*){0,5}\z
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby
\A(?:(?:\r\n?|\n)?[^\r\n]*){0,5}\Z
Regex options: None
Regex flavor: Python
^(?:(?:\r\n?|\n)?[^\r\n]*){0,5}$
Regex options: None
Regex flavor: JavaScript

PHP (PCRE)

if (preg_match('/\A(?>(?>\r\n?|\n)?[^\r\n]*){0,5}\z/', $_POST['subject'])) {
    print 'Subject contains five or fewer lines';
} else {
    print 'Subject contains more than five lines';
}

Other programming languages

See Recipe 3.5 for help implementing these regular expressions with other programming languages.

Discussion

All of the regular expressions shown so far in this recipe use a grouping that matches an MS-DOS/Windows, legacy Mac OS, or Unix/Linux/OS X line break sequence followed by any number of non-line-break characters. The grouping is repeated between zero and five times, since we’re matching up to five lines.

In the following example, we’ve broken up the JavaScript version of the regex into its individual parts. We’ve used the JavaScript version here because its elements are probably familiar to the widest range of readers. We’ll explain the variations for alternative regex flavors afterward:

^          # Assert position at the beginning of the string.
(?:        # Group but don't capture...
  (?:      #   Group but don't capture...
    \r     #     Match a carriage return (CR, ASCII position 0x0D).
    \n     #     Match a line feed (LF, ASCII position 0x0A)...
      ?    #       between zero and one time.
   |       #    or...
    \n     #     Match a line feed character.
  )        #   End the noncapturing group.
    ?      #     Repeat the preceding group between zero and one time.
  [^\r\n]  #   Match any single character except CR or LF...
    *      #     between zero and unlimited times.
)          # End the noncapturing group.
  {0,5}    #   Repeat the preceding group between zero and five times.
$          # Assert position at the end of the string.
Regex options: Free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

The leading ^ matches the position at the beginning of the string. This helps to ensure that the entire string contains no more than five lines, because unless the regex is forced to start at the beginning of the string, it can match any five lines within a longer string.

Next, a noncapturing group encloses the combination of a line break sequence and any number of non-line-break characters. The immediately following quantifier allows this group to repeat between zero and five times (zero repetitions would match a completely empty string). Within the outer group, an optional subgroup matches a line break sequence. Next is the character class that matches any number of non-line-break characters.

Take a close look at the order of the outer group’s elements (first a line break, then a non-line-break sequence). If we reversed the order so that the group was instead written as (?:[^\r\n]*(?:\r\n?|\n)?), a fifth repetition would allow a trailing line break. Effectively, such a change would allow an empty, sixth line.

The subgroup allows any of three line break sequences:

  • A carriage return followed by a line feed (\r\n, the conventional MS-DOS/Windows line break sequence)

  • A standalone carriage return (\r, the legacy Mac OS line break character)

  • A standalone line feed (\n, the conventional Unix/Linux/OS X line break character)

Now let’s move on to the cross-flavor differences.

The first version of the regex (used by all flavors except Python and JavaScript) uses atomic groups rather than simple noncapturing groups. Although in some cases the use of atomic groups can have a much more profound impact, in this case they simply let the regex engine avoid a bit of unnecessary backtracking that can occur if the match attempt fails (see Recipe 2.15 for more information about atomic groups).

The other cross-flavor differences are the tokens used to assert position at the beginning and end of the string. The breakdown shown earlier used ^ and $ for these purposes. Although these anchors are supported by all of the regex flavors discussed here, the alternative regexes in this section used \A, \Z, and \z instead. The short explanation for this is that the meaning of these metacharacters differs slightly between regular expression flavors. The long explanation leads us to a bit of regex history....

When using Perl to read a line from a file, the resulting string ends with a line break. Hence, Perl introduced an “enhancement” to the traditional meaning of $ that has since been copied by most regex flavors. In addition to matching the absolute end of a string, Perl’s $ matches just before a string-terminating line break. Perl also introduced two more assertions that match the end of a string: \Z and \z. Perl’s \Z anchor has the same quirky meaning as $, except that it doesn’t change when the option to let ^ and $ match at line breaks is enabled. \z always matches only the absolute end of a string, no exceptions. Since this recipe explicitly deals with line breaks in order to count the lines in a string, it uses the \z assertion for the regex flavors that support it, to ensure that an empty, sixth line is not allowed.

Most of the other regex flavors copied Perl’s end-of-line/string anchors. .NET, Java, PCRE, and Ruby all support both \Z and \z with the same meanings as Perl. Python includes only \Z (uppercase), but confusingly changes its meaning to match only the absolute end of the string, just like Perl’s lowercase \z. JavaScript doesn’t include any “z” anchors, but unlike all of the other flavors discussed here, its $ anchor matches only at the absolute end of the string (when the option to let ^ and $ match at line breaks is not enabled).

As for \A, the situation is slightly better. It always matches only at the start of a string, and it means exactly the same thing in all flavors discussed here, except JavaScript (which doesn’t support it).

Although it’s unfortunate that these kinds of confusing cross-flavor inconsistencies exist, one of the benefits of using the regular expressions in this book is that you generally won’t need to worry about them. Gory details like the ones we’ve just described are included in case you care to dig deeper.

Variations

Working with esoteric line separators

The previously shown regexes limit support to the conventional MS-DOS/Windows, Unix/Linux/OS X, and legacy Mac OS line break character sequences. However, there are several rarer vertical whitespace characters that you might encounter occasionally. The following regexes take these additional characters into account while limiting matches to five lines of text or less.

\A(?>\R?\V*){0,5}\z
Regex options: None
Regex flavors: PCRE 7 (with the PCRE_BSR_UNICODE option), Perl 5.10
\A(?>(?>\r\n?|[\n-\f\x85\x{2028}\x{2029}])?↵
[^\n-\r\x85\x{2028}\x{2029}]*){0,5}\z
Regex options: None
Regex flavors: PCRE, Perl
\A(?>(?>\r\n?|[\n-\f\x85\u2028\u2029])?[^\n-\r\x85\u2028\u2029]*){0,5}\z
Regex options: None
Regex flavors: .NET, Java, Ruby
\A(?:(?:\r\n?|[\n-\f\x85\u2028\u2029])?[^\n-\r\x85\u2028\u2029]*){0,5}\Z
Regex options: None
Regex flavor: Python
^(?:(?:\r\n?|[\n-\f\x85\u2028\u2029])?[^\n-\r\x85\u2028\u2029]*){0,5}$
Regex options: None
Regex flavor: JavaScript

All of these regexes handle the line separators in Table 4-1, listed with their Unicode positions and names.

Table 4-1. Line separators

Unicode sequence

Regex equivalent

Name

When used

U+000D U+000A

\r\n

Carriage return and line feed (CRLF)

Windows and MS-DOS text files

U+000A

\n

Line feed (LF)

Unix, Linux, and OS X text files

U+000B

\v

Line tabulation (aka vertical tab, or VT)

(Rare)

U+000C

\f

Form feed (FF)

(Rare)

U+000D

\r

Carriage return (CR)

Mac OS text files

U+0085

\x85

Next line (NEL)

IBM mainframe text files (Rare)

U+2028

\u2028 or \x{2028}

Line separator

(Rare)

U+2029

\u2029 or \x{2029}

Paragraph separator

(Rare)

See Also

Recipe 4.9

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.