Errata

Regular Expressions Cookbook

Errata for Regular Expressions Cookbook

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
ePub Page contents
whole contents

I'm using the ePub version with the iBooks app on an iPad Air. The contents page reacts extremely slowly (clicking on an entry takes 15-20 seconds until the linked page shows up). Opening the contents page from the book may also have massive delays. Scrolling works fine in the contents, though. Turning pages in the book itself works fine as well.

This behaviour does not always happen right from the start, sometimes I need to click two or three links in the contents before it becomes slow. Once slow, it remains slow.
I have not experienced this with other ePub ebooks, therefore I assume this is a problem with the book itself.

Nicole Rauch  Aug 31, 2014 
ePub Page Section 2.9, Group with mode modifiers
hard to tell with epub

I'm not exactly sure if this is really an error, so take with a grain of salt:

While the modifiers
(?ism:)
(?-ism:)
both have a colon at the end (making the statement a noncapturing group with mode modifiers),
the third example
(?i-sm)
doesn't have a colon, which would make it a simple mode modifier. While this is not invalid syntax, I would assume that for consistency, the third example would also need a colon:
(?i-sm:)
right?

Patrick Haldi  Feb 12, 2023 
ePub Page section 4.7, section Dates
hard to tell with epub

After the text "Python also uses a different syntax for named backreferences", an example with Python syntax is given (obviously), but the "Regex flavors:" lists ".NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9".

Copy-Paste error from the paragraph above?

Patrick Haldi  Feb 12, 2023 
ePub Page Section 7, Group strings, See Also
footnote 10

The footnote says "...the regex engine will try 2^(1/n) permutations" (^ stands for power, which I don't know how to enter here)

This seems odd. Those are some example values:

n == 1 --> 2 permutations
n == 2 --> 1.414 permutations
n == 3 --> 1.260 permutations

I would expect the number of permutations to be an integer, and to rise with rising n.

Patrick Haldi  Feb 12, 2023 
ePub Page Section 9, Allow > in attribute values
hard to tell with epub

sentence "It's advantage over the previous regex..."

-> It's should be Its

Patrick Haldi  Feb 12, 2023 
Other Digital Version 1
Sections 2.4-2.5 of the Code Samples in Regular_Expressions_Cookbook_2_Code_Samples.html

In the Code Samples in Regular_Expressions_Cookbook_2_Code_Samples.html

The line

Regex options: None (the <<&#65533;&#483;>>dot matches line breaks<<&#65533;&#485;>> option must not be set)

appears three times in sections 2.4-2.5. You probably don't see them, but the characters between the <<.>> are non-standard html characters (probably some Windows-specific character set) and show up as question marks in a black diamond on Mac OS X. They need to be changed to proper HTML entities.

And in the Mobi version, these characters appear as curly quotes, so they should be &ldquo; and &rdquo;, respectively. There must be more places I missed them, because when I fixed this in my copy of the html file, it made 47 changes for each entity.

Do this:
s/&#65533;&#483;/\&ldquo;/g
s/&#65533;&#485;/\&rdquo;/g

Mark Nobles  Oct 17, 2013 
Other Digital Version 2.12

The definition of a googol as a 'decimal number with 100 digits' in recipe 2.12 is incorrect I believe. A googol is a 1 followed by a 100 zeroes, 10^100, so the regex should be:

\b10{100}\b (correct)

but not

\b\d{100}\b (incorrect)

The regex in the book could be 100 numeral "9's" which would be a googol minus one, which also highlights that the digits in the book's regex could be made up of any numerals as long as there are one hundred of them in a row.

Minor point. Enjoying the book nonetheless.

Howard Maher  Aug 26, 2014 
Other Digital Version 2.22
perl solution and discussion

The Perl solution is listed as
$`$_$'

It kind of works, but not as how it is explained.
It will do the substiution correctly like so:

$_ = 'BeforeMatchAfter';
s/Match/$`$_$'/;

However, the explanation is wrong about $_. It is not set to anything after a match (whether in m/ or s//). In the code above, it is assigned the value of the subject string. It has the same value in the replacement part, so the code has the intended effect.
-----------------------------------------------------------------------------------------------
However, if you tried this, it would not work as intended:

$subject = 'BeforeMatchAfter';
$subject =~ s/Match/$`$_$'/;

This would put the value of $_ - whatever it was BEFORE those 2 statements - into the middle of the modified subject string.

----------------------------------------------------------------------------
It is actually $$ that is set to the value of the match, so the general solution is:
$subject = 'BeforeMatchAfter';
$subject =~ s/Match/$`$`$&$'$'/;

$& gets "Match" ONLY, so another $` and $' must be put in the replacement string to get the string changed to "BeforeBeforeBeforeMatchAfterAfterAfter"




Gregory Sherman  Feb 26, 2018 
4.21
Section 4.21

I've noticed a slight issue with one of the regular expressions given for
Irish VAT numbers in chapter 4.21 of your Regular Expressions Cookbook 2nd Edition.

It lists the regular expression for Irish VAT numbers as:

(IE)?[0-9]S[0-9]{5}L

Irish VAT numbers can be of the form 1234567X, 1X23456X, 1234567XX, (8 or 9 characters. Includes one or two alphabetical characters (last, or second and last, or last 2)), so this regexp only captures one of those.

I would suggest you add an additional regexp along the lines of:

(IE)?[0-9]{7}[A-Z]{1,2}

Anonymous  Jun 03, 2015 
PDF Page 34
7th paragraph of Table 3-1

The printed version correctly shows BRE, while the PDF shows ERE (even though it is repeated in the 10th paragraph of the Table.)

Jonathan Knight  Mar 04, 2013 
ePub Page 66-68
several places in 2.5

"Discussion:

End of the subject
The difference between ‹\Z› and ‹\z› comes into play when the last
character in your subject text is a line break. In that case,
‹\Z› can match at the very end of the subject text, after the final
line break, as well as immediately before that line break.
.
.
.
The anchor ‹$› is equivalent to ‹\Z›, as long as you do not turn on
the “^ and $ match at line breaks” option.

End of a line
By default, ‹$› matches only at the end of the subject text
or before the final line break, just like ‹\Z›."


These statements are not correct in regards to Python's \Z anchor, which acts
like the \z anchor in other languages. The following program demonstrates this:


import re

if re.search('foo$', 'foo'): print('foo matched with $')
if re.search('foo\Z', 'foo'): print('foo matched with \Z')
if re.search('foo$', 'foo\n'): print('foo\\n matched with $')
if re.search('foo\Z', 'foo\n'): print("foo\\n matched with \Z")



Gregory Sherman  Sep 26, 2018 
ePub Page 116
section 2.15

Solution
"JavaScript and Python do not support atomic grouping. There is no
way to eliminate needless backtracking with these two regex flavors."

Discussion
"Some regex implementations are clever and will abort
runaway match attempts early, but even then the regex will still kill
your application’s performance"

What are the "some"?
A long string without the closing </html> tag fails quickly against
the given "naive" reguar expression in both Perl and Python.
It is also matched quickly when </html> is present, so no significant effect
on performance can be seen.

Gregory Sherman  Sep 19, 2018 
PDF Page 151
Bottom of page; second VB.Net example

The second example creates a RegEx object but fails to use it in the code. The code is nearly identical to the first VB.NET example. The error is repeated in the downloaded sample code.

Anonymous  Feb 25, 2014 
ePub Page 196
Perl

Perl stores the position where the match of each capturing group starts in the array @- and the position where each group ends in @-.
-------------------------------------------------------------------------------------------------------
should be "... ends in @+."

Gregory Sherman  Mar 30, 2018 
ePub Page 204
4.9 Named Capture - Perl

if ($subject =~ '!httX://(?<domain>[a-z0-9.-]+)%!) {
should be
if ($subject =~ m!httX://(?<domain>[a-z0-9.-]+)!) {

NOTE: There is no "X", but I had to put it in place of "p" so the system would accept this description.

Gregory Sherman  Oct 20, 2018 
Printed, Other Digital Version Page 211
1st paragraph

The last sentence in the 'Perl and Ruby' paragraph states:
"we can easily retrieve the text between the match and the previous one with $`"

This is incorrect, I believe, at least according to my tests and what Jeffrey Friedl says in 'Mastering Regular Expressions, 3rd edtion', page 300: "you might wish $` to be the text from start of the match attempt, but it's the text from the start of the whole string, each time."

The code in your Perl example still works because the substitution across the entire string from the beginning to the previous match has already been done and so the code would simply ignore those quotes already changed.

Nonetheless, I believe it is a misunderstanding to think that $` is that part of the string since the last match, which could cause serious issues in cases where someone really doesn't want their code to re-iterate over the beginning of the string again... perhaps changing items numerous times...

Please let me know if I have misunderstood your point above or the issue. Thanks!

Howard Maher  Sep 30, 2014 
Printed, Other Digital Version Page 218
4th paragraph, last line

The book states (under the Perl discussion): "If a match occurs at the end of the subject string, the last element in the array will be an empty string."

This would not be the case with the example for Perl a couple of pages back, not is it always true.

It is true only when the third parameter to the split operator is large enough to cover 'empty' elements at the end of the string, or if a '-1' is used as the 3rd parameter.

If, as in the example in the book, there is no optional 3rd parameter to split, then empty fields at the end of the string will be dropped.

So, to make the above statement more complete and accurate, one might append:
"if the 3rd parameter to split is used and is large enough to cover empty fields at the end, or if the 3rd element is set to '-1', otherwise empty fields at the end will be dropped from the array."

I have seen the above behaviour byte experienced Perl programmers quite unexpectedly... :-)

Howard Maher  Sep 30, 2014 
Printed, Other Digital Version Page 236
middle

Two typos: 'table' should be '$table' on the line
"if (!defined(table)){"
and on the line
"return table;"
Either will give the error message:
"Bareword 'table' not allowed while 'strict subs' in use at..."

Howard Maher  Oct 01, 2014 
ePub Page 248
3.18 Perl solution

The Perl solution does not work as intended. The mistake was to assume that using "/g" on a match affects where $` begins. It doesn't; $` captures all of the string before the match, regardless of what pos() would have returned before.

Here is a Perl program that has been confirmed to solve the problem:

$result ="";
$textafter = $subject = '"text" <span class="middle">"text"</span> "text"';


while ($textafter =~ m/<[^<>]*>/) {
$match = $&;
$textafter = $';
($textbetween = $`) =~ s/"([^"]*)"/\x{201C}$1\x{201D}/;
$result .= $textbetween . $match;
}
$textafter =~ s/"([^"]*)"/\x{201C}$1\x{201D}/;
$result .= $textafter;

Gregory Sherman  Mar 07, 2018 
ePub Page 276
5th "Regular expression syntax" paragraph

Recipe 2.3 tells you all about character classes, including combining them with shorthands, as in
‹[A-Z0-9_!#$%&'*+/=?`{|}~^.-]›. This class
matches a word character, as well as any of the 19 listed punctuation
characters.
----------------------------------------------------------------------------------------------
The regex is actually missing the "shorthand". It should be:
[\w!#$%&'*+/=?`{|}~^.-]

Gregory Sherman  Mar 30, 2018 
ePub Page 281
Regular expression

The following regular expression is the first solution in 4.2. It is supposed to match valid North American phone numbers (while ignoring some restrictions later discussed):

^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$

It matches several bad forms, including
(2135551212
(714)-555-1212
310)555 1212


A Perl regex and substitution that matches the valid forms listed in the text, while excluding those that don't handle the optional parentheses correctly, is:

s#^(?<p>\()?(?<a>[0-9]{3})(?(<p>)\) ?|[-. ]?)(?<e>[0-9]{3})[-. ]?(?<n>[0-9]{4})$#($+{a}) $+{e}-$+{n}#




Gregory Sherman  Apr 02, 2018 
Printed Page 415-418
Recipes 7.5, 7.6, and 7.7

All three of these recipes for matching comments -- single-line comments, multiline comments, and all comments -- have the same flaw. They produce a false positive match when the comment symbols appear within a string, such as:

echo "// hello";

so these simple regex solutions can't be used for processing real source code.

This shortcoming should at least be mentioned explicitly in the recipes, even if you decide not to change the solutions.

The interaction between string and comment syntax is quite interesting and would make a good recipe on its own. Not only can strings include comment symbols, but comments can include single and double quotes.

Daniel Barrett  Apr 07, 2016 
PDF, ePub Page 420 PDF - (2019 epub)
footnote

Footnote 2:

If there are n characters between the double quote and the end of the string, the regex engine will
try 2 1/n permutations of ? (?:[^"\r\n]+|"")* ?

----

it appears: 2^(1/n)
(this trends to 1, when n grows)

however, it must say 2^(n-1)
when n=4 the number of permutations is 8.
(thus, it has an exponential behavior).

German Gonzalez-Morris  Dec 16, 2013 
ePub Page 543
paragraph beginning with "Each repetition"

update to previous report:

I have found that the presence of \b would make a difference
(at least in Python and Perl regular expressions) only if the minimum #
in the quantifier is greater than the # of words in the string

------------------------------------------------------------------------
use re
if re.search(r"^\W*(?:\w+\W*){5,}$", 'fee fi fo fum') : print ("match")
if re.search(r"^\W*(?:\w+\b\W*){5,}$", 'fee fi fo fum') : print ("match with \\b")

------------------------------------------------------------------------

print "match\n" if 'fee fi fo fum' =~ /^\W*(?:\w+\W*){5,}$/;
print "match with \\b\n" if 'fee fi fo fum' =~ /^\W*(?:\w+\b\W*){5,}$/;



Gregory Sherman  Mar 21, 2021 
ePub Page 545
paragraph beginning with "Each repetition"

"\b is needed between \w and \W ... to ensure that each repetition of the group really matches an entire word"

This is not only false in this specific case, as can be seen by experimenting with code like the Python and Perl statements below. The word boundary \b is never needed between any forms of the pair of character classes \w and \W. It is entirely redundant.
------------------------------------------------
print "match\n" if 'fee fi fo fum ' =~ /^W*(?:\w+\b\W*){1,4}$/;

-------------------------------------------------
use re
if re.search("^\W*(?:\w+\W*){1,4}$", 'fee fi fo fum ') : print ("match")

Gregory Sherman  Mar 21, 2021 
ePub Page 760, 761
solutions

Both the "Basic ..." and "Match separators only ..." regular expressions can be simplified by replacing the negative lookahead with a word boundary:
(?![0-9]) becomes \b
Just like the versions in the text, they will work on isolated integers and floating point numbers with up to 3 places after the decimal point. The simplified versions will not add commas that the original ones would in a few instances, like the two strings:

x6789y

1234z
==================

Perl (5.14+) example

$_ = '1234567890.1234 8765.432 x9876y $6789 1234z @12345! 9.01 555-1212 7666999.222 678 .123456789';
print s/\d(?=(\d{3})+\b)/$&,/gnr, "\n";

Gregory Sherman  Mar 25, 2021 
ePub Page 839
Strings with Escapes - Solution

The regular expression is missing the asterisk just before the final double quotes.
It should be:
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"


Gregory Sherman  Mar 26, 2021