We’re halfway
there: we’ve modified all our filenames to be consistently
lowercase and to end in .html
. Now we just need to
edit the HREF
attributes of the links inside those
HTML files to reflect those changes. To do that, we will need to
write a new script that can open up each member of a list of files
that is passed to it, make changes to that file, and save the changes
back to disk.
Even more than the renaming-files example we just finished, this one exposes us to a real risk of accidentally doing bad things to our data. Again, please make sure you have a good backup before proceeding. Also, see Parsing HTML with Regexes Considered Harmful for a discussion of some of the limitations of the approach presented here.
Here is a script that represents a first step in altering this example site’s HTML documents to match the filename changes made in the first half of the chapter:
#!/usr/bin/perl -w # fix_links.plx # this script processes all the *.html files whose names are supplied # to it on the command line, replacing all HREF attributes # that point to local resources in the current directory # with rewritten versions that have: # # 1) '.htm' extensions changed to '.html', and # 2) VaRiEnT captialization uniformly downcased. foreach $file (@ARGV) { unless (-f $file) { warn "$file is not a plain file. Skipping...\n"; next; } unless ($file =~ /\.html$/) { warn "$file doesn't end in .html. Skipping...\n"; next; } if ($file =~ m{/}) { warn "$file contains a slash. Skipping...\n"; next; } open IN, $file or die "can't open $file for reading: $!"; while ($line = <IN>) { print $line; } close IN; }
This script actually looks a lot like rename.plx
,
the last example we worked on. After an initial comment explaining
what it does, it has a foreach
loop that processes
each filename passed to it in the command-line arguments. Inside that
loop come some sanity checks that should look pretty familiar because
they’re based on those in rename.plx
.
You’ll notice two slight differences in the second sanity
check’s regular expression pattern compared with the
corresponding one in rename.plx
: a literal
l
(the letter “L”) has been added (so
it looks for files ending in .html
rather than
.htm
), and the /i
modifier has
been removed, such that only lowercase .html
extensions will qualify.
Next comes the following line, which opens the
$file
currently being processed through the
foreach
loop, so the script can read it:
open IN, $file or die "can't open $file for reading: $!";
This is the standard Perl idiom for opening up a file in order to
read its contents into your script. Perl’s
open
function takes two arguments: a
filehandle name (by convention, it should
be ALL CAPS), and a string specifying both the name of the file you
want to open and, optionally, a symbol specifying how you want to
open it.
You saw in the last chapter how
putting a pipe symbol (|
) at the beginning of the
filename string opened a pipe to an external program, such that
printing to the filehandle sent the printed output to that
program’s standard input. Because the default behavior of the
open
function is to open the file for reading,
though, and because opening the file for reading is exactly what you
want to do in this case, you can dispense with the symbol and just
give the filename, which is what this line does.
After the open
statement is the all-important
or die
clause, to have the script die with an
error message if the file can’t be opened.
So,
we have a filehandle opened for reading.
In order to actually read from it, we put the filehandle inside a
pair of angle brackets, which causes Perl to return a line of data
from the file. The way you typically do that in your Perl script is
with a while
loop, like the one that comes next in
this script:
while ($line = <IN>) { print $line; }
A while
loop is sort of a cross between an
if
block and a foreach
loop.
Its general form is:
while (something
) {
do something;
do some other something;
do still some other something;
}
Like an if
block, the part inside the parentheses
is tested, and the block fires off only if the thing being tested
returns a true value. Like a
foreach
loop, though, the script can execute the block multiple times. What
happens is, after the conditional test (that is, the part inside the
parentheses) returns a true value and the script makes its first trip
through the block, the conditional test is evaluated again, and if
it’s still true, the block is executed again. And so on,
ad infinitum.
Obviously, it could be a problem for your script if you put something
in your while
loop’s conditional test that
never became false. The number 1
, for example, is
always “true,” so the following loop would in effect be a
trap from which your script could never escape:
while (1) { print "hello, world!\n"; }
If you put this code in your script, it would simply print
hello, world!\n
over and over again, forever (or
until you remembered that you can kill your script in mid-execution
by typing Ctrl-C
in the shell).
Let’s take another look now at the line that begins this
while
loop, looking particularly at the logical
test:
while ($line = <IN>) {
Perl looks at the return value of whatever is inside the parentheses
in order to determine truth or falseness for the purpose of
controlling the while
loop. In this case, what is
inside the parentheses is an assignment to the scalar variable
$line
. As Perl sees things, the return value of an
assignment operation (that is, the thing that will be tested for
truth) is whatever is assigned. So, what’s being assigned? The
output of <IN>
, which, as I mentioned a few
moments ago, is simply a line from the file previously opened for
reading and associated with the filehandle IN
.
Specifically, the line that is assigned is the “next”
line because that is how the <IN>
input
operator works. The first time it is evaluated it returns the first
line of the file, the next time it is evaluated it returns the next
line, and so on, until the end of the file is reached, at which point
it returns the undefined value, which is false, thereby terminating
the loop.
In simpler language, the particular while
loop
shown here will run once for each line in the file, with the current
line being stored in $line
for each trip through
the loop. In that sense it is somewhat analogous to running a
foreach
loop on an array consisting of all the
lines in the file.
But wait. Some extremely clever and attentive reader out there is
wondering about a special case. What if the file being read from
contains a blank line, or a line consisting solely of the number
0
? When a line like that is read and assigned to
the $line
variable, the while
loop’s test will be evaluating the empty string, or the number
0
. Either of those will evaluate to false (because
the empty string and the number 0
are both false,
according to Perl), thereby terminating the loop before the entire
file has been read. Won’t that be a problem?
Well, no. That doesn’t happen, and it doesn’t happen
because of a subtle fact about the input operator that you should try
hard to remember because if you are anything like me you will fail to
remember it many times during your Perl education, leading to some
subtle bugs in your code. The reason an empty line, or a line
containing only the number 0
, doesn’t cause
the while
test to fail is that the input operator
doesn’t just read and return the line itself. It also returns
the newline that marks the end of the line. Or, more precisely, it
considers that newline to be part of the line, and so returns it
along with whatever came before it.
So you see, an empty line in the file isn’t really empty. It
consists of a newline, which is not the empty string, and isn’t
0
, and so is true, according to Perl. The same is
true of a line containing just the number 0
: it
isn’t just the number 0
(or the string
"0"
) that gets returned by the input operator;
it’s the 0
followed by a newline, and hence
(again) is true.
If you run this new fix_links.plx
script in the
directory containing your HTML files, it will print out to your
screen the contents of all the files specified in its arguments (or
at least, all those that make it past the sanity checks). Pipe the
command’s output to the more
command, and
you can page through that output a screenful at a time until you get
tired of doing so, when you can just type q
to
quit from the pager and return to your shell prompt:
[jbc@andros testsite]$ fix_links.plx *.html | more
<HTML>
<HEAD>
<TITLE>This is the title</TITLE>
</HEAD>
<BODY>
Now that our script is successfully reading in the contents of the HTML files, we just need to assign those contents to a variable, rewrite any links in them that point to the newly renamed files, and print the rewritten contents back out to the file. To do that, we begin by adding the following line just before the line where we open the file for reading:
$content = '';
This sets a $content
variable to contain the empty
string, such that it will be emptied for each trip through the
enclosing foreach
loop.
Then, we delete the print
statement from inside
the while
loop, and modify it so that it looks
like this:
while ($line = <IN>) { # for HREF attributes pointing to the current directory, # downcase attribute, and rename '.htm' to '.html' $line =~ s/HREF="([^"\/]+\.htm)"/HREF="\L$1\El"/gi; $content .= $line; }
This new version of the while
loop adds a
search-and-replace operation, courtesy of Perl’s
substitution operator
. The substitution
operator, as you saw briefly in the previous chapter, has a search
pattern just like a “regular” regular expression, except
that it adds a replacement string
, which is used
to replace the part of the string that matched the search pattern.
Figure 4-2 breaks this particular substitution
operator down into its component parts.
Notice how a delimiter is in the middle, between the search pattern
and the replacement string. Notice also how any optional modifiers go
after the final delimiter (that is, after the replacement string).
When this substitution operator is run against the
$line
variable, which holds the current line being
read from the current file, whichever part of the string in
$line
matches the search pattern will be replaced
by the replacement string.
This substitution operator is designed to find all the strings of the
form HREF="Something.HTM"
in the file, and replace
them with strings of the form
HREF="something.html"
(that is, downcasing the
attributes and sticking an “l” on the end of the filename
extension). So, let’s see how it achieves that. After the
initial s
(which flags this as the substitution
variety of regex) and the opening delimiter, we have the following
search pattern:
HREF="([^"\/]+\.htm)"
This pattern looks a bit mind-boggling at first, but don’t let that throw you. Like any regular expression pattern, it will eventually yield its secrets to a determined analysis. You just have to be patient and go through it carefully, one character at a time.
This pattern begins by matching the literal string
HREF="
. You’ll notice that the
non-alphanumeric characters =
and
"
don’t need to be backslashed; they
don’t mean anything special by themselves in a regex pattern.
You could backslash them if you wanted to, just
to be safe, and nothing bad would happen, but I haven’t
bothered.
Next comes a left parenthesis, which is a
special character in a regex search pattern. It doesn’t match
anything in the string being matched against. Instead, it, along with
its paired right parenthesis that comes later, serves to mark off a
part of the expression for a capturing
operation. This means that part of the string being matched against
(the part corresponding to the part of the pattern enclosed by the
parentheses) is going to be remembered and will be available later
(in this case, in the replacement string we’ll be looking at in
just a moment).
The next part of the pattern is the following interesting-looking
construct: [^"\/]
. Those square brackets create
something called a character class
in the
pattern. (If you read More Fun with Shell Expansion earlier in this chapter, you might remember reading
about character classes there. Perl’s regex character classes
work the same way.)
In essence, a character class is a list of characters, any one of
which can match at this point in the pattern. For example, the class
[abc]
would allow any of the characters
a
, b
, or c
to match. Except that, just to make things interesting, this
particular character class begins with a caret symbol
(^
). When a character class begins with a caret
symbol, it becomes a negated character class
.
That is, it becomes a list of characters that
can’t match at that point in the pattern.
To put it another way, using a negated character class is the same
thing as using a normal character class containing all the characters
except the ones actually listed in the negated class. So, this
particular character class says, in effect, “match any
character except a double quote or a forward slash.” (The
forward slash, you will notice, is escaped by a leading backslash, so
it isn’t interpreted as the end of the regex pattern. The
double quote, though, doesn’t need to be backslashed because it
has no special meaning in a regex search pattern.)
After this negated character class comes a plus sign
(+
), which is a regular expression
quantifier
. A quantifier doesn’t match
anything by itself. Instead, it tells the expression how many of the
immediately preceding item to match. The plus sign quantifier says
“match one or more of the preceding item.” Because the
quantifier is “greedy,” it will match as many characters
as it can. In this case, the immediately preceding item is the
negated character class, which means that this plus sign says,
“match any of the characters allowed by the preceding character
class. You have to match at least one of them in order for the
expression as a whole to successfully match, but you should keep
matching until you have matched as many as you can.”
The next part of the pattern says to match a literal period (escaped
by a backslash, to turn off its special meaning), then the literal
characters htm
. Then comes a closing parenthesis,
which again is not meant to be matched literally, but instead ends
the capturing operation begun earlier. Finally, at the end of the
pattern is a literal double quote (which doesn’t have to be
escaped, though you could throw a backslash in front of it just to be
safe if you weren’t sure).
So much for the search pattern of this substitution regex. The second part, the replacement string, looks like this:
HREF="\L$1\El"
This works pretty much like a regular double-quoted string.
We’ve used a couple of interesting twists in this case, though.
The first interesting thing is that we’ve used a special scalar
variable, $1
. This variable holds the part of the
text string being matched against that was matched by the part of the
regex pattern enclosed by parentheses. (If we had more than one pair
of capturing parentheses in the search pattern, we could access the
part captured in the leftmost-starting pair via
$1
, the next via $2
, and so
on.) Because the replacement string works like a double-quoted
string, that variable will be interpolated, meaning it will be
replaced by whatever is stored in the variable.
The second interesting thing we’ve done is to use the
\L
string escape sequence (which works in any
double-quoted string, not just the replacement part of a substitution
operator) to force all the text that comes after it to be lowercase.
The \E
escape sequence that comes later tells Perl
to stop doing the lowercase thing; that is, it turns off the
lowercasing previously turned on by \L
. In this
case that \E
isn’t really necessary because
the only remaining characters in the string at that point are the
lowercase l
(an “L”, not to be
confused with the earlier numeral 1
in
$1
), and a double quote ("
),
neither of which would have been changed by
\L
’s lowercasing behavior. Still, I figured
it was good form to explicitly end the lowercasing with
\E
, so you’d see how that was done.
After the replacement string comes the final delimiter and the
trailing modifiers, of which there are two in this case: the
/g
modifier and the /i
modifier, both shoved together (the order doesn’t matter). The
/i
modifier does the same thing it does in a
normal, nonsubstituting regex: it makes the search pattern’s
alphabetic characters case-insensitive. The /g
modifier does something interesting: it makes the substitution
operation global
, in the sense that after the
first substitution has taken place, the substitution operator will
continue looking for more places that match, and making more
substitutions, until it has gone through the entire string. Without
the /g
modifier the substitution operator would
perform its search and replace operation only once, at the first
point in the string where it found a match, leaving the rest of the
string untouched.
So, to review this substitution operation as a whole, here’s
what it does: it searches through the current line of the current
file, looking for HREF
attributes of the form:
HREF="(something that includes no double quote or slash characters, and ends
with .htm)"
. When
it finds one, it captures everything from the double quote at the
beginning of the attribute value to the one at the end of the
attribute value, then replaces the entire thing with a version of
itself that has a lowercase l
appended to the end
of the attribute value.
The requirement that the captured sequence contain no double quotes
is just a way of making sure to capture the entire
HREF
attribute value (which this script assumes
will always be delimited by double quotes), and nothing more. The
requirement that the captured sequence contain no slash characters is
a way of restricting this replacement operation to working only on
HREF
attributes that point to HTML files in the
current directory. That is, the substitution operator will only
modify attributes that don’t contain a full URL (with a leading
http://
), and don’t have any sort of path
component pointing to a different directory. This way, your
fix_links.plx
script will not try to rewrite
HREF
attributes that point to files in other
directories, or on other web sites. (Note that this will break on
non-Unix systems that use something other than a forward slash as a
directory separator.)
Returning to the disclaimers offered at the beginning of this
chapter, this script is designed to work only in a particular set of
circumstances. We’ve assumed that we only want to modify links
pointing to files in the current directory. We’ve further
assumed that all the links we are interested in rewriting are in
HREF
attributes delimited by double quotes, with
no space, tab, or newline characters on either side of the
=
joining the HREF
to the
attribute value. Finally, we’ve assumed that there are no
strings of this form in the files in question other than those
actually in <A>
tags. That’s a lot of
assuming, granted, but in these particular (hypothetical)
circumstances, the script is good enough to get the
job done.
Get Perl for Web Site Management now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.