Cover | Table of Contents | Colophon
The the···', as well as allow differing amounts of whitespace (spaces, tabs, newlines, and the like) to lie between the words.···it is <B>very</B> very important···'.SetSize'
exactly as often (or as rarely) as it contained 'ResetSize'. To complicate matters, I
needed to disregard capitalization (such that, for example, 'setSIZE' would be
counted just the same as 'SetSize'). Inspecting the 32,000 lines of text by hand
certainly wasn't practical.
^(From|Subject):
from the last example, but there's nothing magic about it. For that matter, there is nothing magic about magic.
The magician merely understands something simple which doesn't appear to be
simple or natural to the untrained audience. Once you learn how to hold a card
while making your hand look empty, you only need practice before you, too, can
"do magic." Like a foreign language — once you learn it, it stops sounding like
gibberish.*.txt" can be used
to select multiple files. With filename patterns like this (called file globs or wildcards), a few characters have special meaning. The star means "match anything,"
and a question mark means "match any one character." So, with the file glob
"*.txt", we start with a match-anything
*
and end with the literal
.txt
, so we end up with a pattern that means "select the files whose names start with anything and end with .txt".regex.info, for links on how to obtain a copy of
^
(caret) and
$
(dollar),
which represent the start and end, respectively, of the line of text as it is being
checked. As we've seen, the regular expression
cat
finds c·a·t anywhere on the
line, but
^cat
matches only if the c·a·t is at the beginning of the line—the
^
is
used to effectively anchor the match (of the rest of the regular expression) to the
start of the line. Similarly,
cat$
finds c·a·t only at the end of the line, such as a
line ending with scat.matches a line with^cat
![]()
catat the beginning
...zip is 44272. If you write, send $4.95 to cover postage and...
[0-9]+
, you don't care which numbers are matched. However, if your intent is to do something with the number (such as save to a file, add, replace, and such — we will see examples of this kind of processing in the next chapter), you'll care very much exactly which numbers are matched.
Answers to the questions in Section 1.4
.
|
Literally means: matches if the line has a beginning-of-line (which, of
course, all lines have), followed immediately by
c·a·t, and then followed
immediately by the end of the line.
Effectively means: a line that consists of only
cat — no extra words,
spaces, punctuation... just 'cat'. |
|
Literally means: matches if the line has a beginning-of-line, followed
immediately by the end of the line.
Effectively means: an empty line (with nothing in it, not even
spaces).
|
|
Literally means: matches if the line has a beginning-of-line.
Effectively
meaningless! Since every line has a beginning, every line
will match—even lines that are empty!
|
Answer to the question in Section 1.4.3
.
$/ = ".\n";
while (< >) {
next if !s/\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[7m$3\e[m/ig;
s/^(?:[^\e]*\n)+//mg; # Remove any unmarked lines.
s/^/$ARGV: /mg; # Ensure lines begin with filename.
print;
}
\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)
^(?:[^\e]*\n)+
^
^
is certainly recognizable, but the other expressions
have items unfamiliar to our egrep-only experience. This is because Perl's
regex flavor is not the same as egrep's. Some of the notations are different, and
Perl (as well as most modern tools) tend to provide a much richer set of metacharacters
than egrep. We'll see many examples throughout this chapter.ResetSize'
exactly as many times as 'SetSize'. The utility I used was Perl, and the command
was:% perl -0ne 'print "$ARGV\n" if s/ResetSize//ig != s/SetSize//ig' *
$reply and reports whether it contains only digits:
if ($reply =~ m/^[0-9]+$/) {
print "only digits\n";
} else {
print "not only digits\n";
}
^[0-9]+$
, while the surrounding
m/···/
tells Perl what to do with it. The
m
means
to attempt a regular expression match, while the slashes delimit the regex itself.
The preceding
=~
links m/···/ with the string to be searched, in this case the contents
of the variable $reply.=~
with
=
or
==
. The operator
==
tests whether two numbers are
the same. (The operator
eq
, as we will soon see, is used to test whether two
strings are the same.) The
=
operator is used to assign a value to a variable, as
with
$celsius = 20
. Finally,
=~
links a regex search with the target string to be
searched. In the example, the search is m/^[0-9]+$/ and the target is $reply.
Other languages approach this differently, and we'll see examples in the next
chapter.=~
as "matches," such that
if ($reply =~ m/^[0-9]+$/)
$reply
matches the regex
^[0-9]+$
,
then ...$reply =~ m/^[0-9]+$/
is a true value if the
/
regex
/
attempts to match the given regular expression
to the text in the given variable, and returns true or false appropriately. The
similar construct
/
regex
/
replacement
/
takes it a step further: if the regex
is able to match somewhere in the string held by $var, the text actually matched
is replaced by replacement. The regex is the same as with m/···/, but the replacement
(between the middle and final slash) is treated as a double-quoted string.
This means that you can include references to variables, including $1, $2, and so
on to refer to parts of what was just matched.$var =~ s/···/···/
the value of the variable is actually changed. (If
there is no match to begin with, no replacement is made and the variable is left
unchanged.) For example, if $var contained Jeff•Friedl and we ran$var =~ s/Jeff/Jeffrey/;
$var would end up with Jeffrey•Friedl. And if we did that again, it would end
up with Jeffreyrey•Friedl. To avoid that, perhaps we should use a wordboundary
metacharacter. As mentioned in the first chapter, some versions of egrep
support
\<
and
\>
for their start-of-word and end-of-word metacharacters. Perl,
however, provides the catch-all
\b
, which matches either:$var =~ s/\bJeff\b/Jeffrey/;
Answer to the question in Section 2.2.3.
[•
]*
and
•*|
*
compare?
(•*|
*)
allows either
•*
or
*
to match, which allows either some spaces
(or nothing) or some tabs (or nothing). It doesn't, however, allow a combination
of spaces and tabs.
[•
]*
matches
[•
]
any number of times. With a string
such as '
••' it matches three times, a tab the first time and spaces the rest.
[•
]*
is logically equivalent to
(•|
)*
, although for reasons shown in
Chapter 4, a character class is often much more efficient.
Answer to the question in Section 2.3.
$var =~ s/\bJeff\b/Jeff/i do?
\bJEFF\b
or
g/
Regular Expression
/p ", was read "Global Regular Expression Print." This particular function was so useful that it was made into its own utility, grep (after which egrep —extended grep —was later modeled).if ($line =~ m/ ^Subject: (.*)/i) { $subject = $1; }
^From:(.*)". What confuses many, especially early
on, is the need to deal with the language's own string-literal metacharacters when
composing a string to be used as a regular expression.\t, \\, and \x2A, which are interpreted while
the string's value is being composed. The most common regex-related aspect of
this is that each backslash in a regex requires two backslashes in the corresponding
string literal. For example, "\\n" is required to get the regex
\n
.\n", with many
languages you'd then get
, which just happens to do exactly the same thing as
\n
. Well, actually, if the regex is in an /x type of free-spacing mode,
| Character Representationssee Section 3.4.1.1 see Section 3.4.1.3 see Section 3.4.1.4 see Section 3.4.1.5 |
Character Shorthands:
\n\t\e
Octal Escapes:
num
Hex/Unicode Escapes:
num
{num}
num,num,
Control Characters:
char
|
| Character Classes and class-like constructssee Section 3.4.2.1 see Section 3.4.2.2 see Section 3.4.2.4 see Section 3.4.2.5 see Section 3.4.2.6 see Section 3.4.2.7 see Section 3.4.2.8 see Section 3.4.2.9 see Section 3.4.2.10 see Section 3.4.2.11 |
Normal classes:
[a-z]
and
[^a-z]
|
*
,
+
,
?
, and
{m,n}
) are greedy.
to(nite|knight|night)
against
the text '···tonight···' . Starting with the
t
, the regular expression is examined one
component at a time, and the "current text" is checked to see whether it is
matched by the current component of the regex. If it does, the next component is
checked, and so on, until all components have matched, indicating that an overall
match has been achieved.
to(nite;knight;night)
example, the first component is
t
, which repeatedly fails until a 't' is reached in the target string. Once that happens, the
o
is checked against the next character, and if it matches, control moves to the next
component. In this case, the "next component" is
(nite|knight|night)
which
really means "
nite
or
knight
or
night
. " Faced with three possibilities, the engine just tries each in turn. We (humans with advanced neural nets between our ears) can see that if we're matching
···x?···
, the engine must decide whether it should
attempt
x
. Upon reaching
···x+···
, however, there is no question about trying to match