Modifying HREF Attributes

We’re halfway there: we’ve modified all our filenames to be consistently lowercase and to end in .html. Now we just need to edit the HREF attributes of the links inside those HTML files to reflect those changes. To do that, we will need to write a new script that can open up each member of a list of files that is passed to it, make changes to that file, and save the changes back to disk.

Even more than the renaming-files example we just finished, this one exposes us to a real risk of accidentally doing bad things to our data. Again, please make sure you have a good backup before proceeding. Also, see Parsing HTML with Regexes Considered Harmful for a discussion of some of the limitations of the approach presented here.

Parsing HTML with Regexes Considered Harmful

The accompanying example shows how to use a simple regular expression to manipulate a collection of HTML files (specifically, to alter the HREF attributes in the files’ <A> tags). Although I think it’s a useful example (or I wouldn’t be presenting it here), you should be aware of the inherent limitations of this approach.

In short, regular expressions, for all their power, simply can’t do a good job of breaking down an HTML document into its component parts (a process called parsing). Among the things that are perfectly valid in an HTML document, but which tend to give simple regex-based parsers fits, are:

Tags that continue across multiple lines

Tag attributes that contain quoted angle brackets

Tags “hidden” inside HTML comments

The accompanying example avoids some of these issues by simply looking for strings of the form HREF="something", but this opens an even larger can of worms because it simplistically assumes that any such string is actually part of an <A> tag, which, obviously, need not be the case. Further, real-world HREF attributes often have spacing on either side of the =, which this script doesn’t account for.

As perlfaq9 (that is, the Perl documentation file you can read by entering man perlfaq9 in the shell) points out, in the section headed, “How do I remove HTML from a string?”, the most correct way to parse an HTML document in a Perl script is to install and use the HTML::Parser module from CPAN. Unfortunately, you won’t be learning how to do things like that until somewhat later in this book. For now, just be aware that the approach presented here, while it offers some distinct advantages compared to manually editing dozens or hundreds of HTML files, falls well short of being an ideal solution and could, in fact, result in accidental mangling of your data. Please be careful.

First Version of the fix_links.plx Script

Here is a script that represents a first step in altering this example site’s HTML documents to match the filename changes made in the first half of the chapter:

#!/usr/bin/perl -w

# fix_links.plx

# this script processes all the *.html files whose names are supplied
# to it on the command line, replacing all HREF attributes
# that point to local resources in the current directory
# with rewritten versions that have:
#
# 1) '.htm' extensions changed to '.html', and 
# 2) VaRiEnT captialization uniformly downcased.

foreach $file (@ARGV) {

    unless (-f $file) {
        warn "$file is not a plain file. Skipping...\n";
        next;
    }
    
    unless ($file =~ /\.html$/) {
        warn "$file doesn't end in .html. Skipping...\n";
        next;
    }

    if ($file =~ m{/}) {
        warn "$file contains a slash. Skipping...\n";
        next;
    }
    
    open IN, $file or die "can't open $file for reading: $!";
    while ($line = <IN>) {
        print $line;
    }
    close IN;
}

This script actually looks a lot like rename.plx, the last example we worked on. After an initial comment explaining what it does, it has a foreach loop that processes each filename passed to it in the command-line arguments. Inside that loop come some sanity checks that should look pretty familiar because they’re based on those in rename.plx. You’ll notice two slight differences in the second sanity check’s regular expression pattern compared with the corresponding one in rename.plx: a literal l (the letter “L”) has been added (so it looks for files ending in .html rather than .htm), and the /i modifier has been removed, such that only lowercase .html extensions will qualify.

Next comes the following line, which opens the $file currently being processed through the foreach loop, so the script can read it:

open IN, $file or die "can't open $file for reading: $!";

This is the standard Perl idiom for opening up a file in order to read its contents into your script. Perl’s open function takes two arguments: a filehandle name (by convention, it should be ALL CAPS), and a string specifying both the name of the file you want to open and, optionally, a symbol specifying how you want to open it.

You saw in the last chapter how putting a pipe symbol (|) at the beginning of the filename string opened a pipe to an external program, such that printing to the filehandle sent the printed output to that program’s standard input. Because the default behavior of the open function is to open the file for reading, though, and because opening the file for reading is exactly what you want to do in this case, you can dispense with the symbol and just give the filename, which is what this line does.

After the open statement is the all-important or die clause, to have the script die with an error message if the file can’t be opened.

Reading from a File with a while Loop

So, we have a filehandle opened for reading. In order to actually read from it, we put the filehandle inside a pair of angle brackets, which causes Perl to return a line of data from the file. The way you typically do that in your Perl script is with a while loop, like the one that comes next in this script:

while ($line = <IN>) {
    print $line;
}

A while loop is sort of a cross between an if block and a foreach loop. Its general form is:

while (something) {
    do something;
    do some other something;
    do still some other something;
}

Like an if block, the part inside the parentheses is tested, and the block fires off only if the thing being tested returns a true value. Like a foreach loop, though, the script can execute the block multiple times. What happens is, after the conditional test (that is, the part inside the parentheses) returns a true value and the script makes its first trip through the block, the conditional test is evaluated again, and if it’s still true, the block is executed again. And so on, ad infinitum.

Obviously, it could be a problem for your script if you put something in your while loop’s conditional test that never became false. The number 1, for example, is always “true,” so the following loop would in effect be a trap from which your script could never escape:

while (1) {
    print "hello, world!\n";
}

If you put this code in your script, it would simply print hello, world!\n over and over again, forever (or until you remembered that you can kill your script in mid-execution by typing Ctrl-C in the shell).

Let’s take another look now at the line that begins this while loop, looking particularly at the logical test:

while ($line = <IN>) {

Perl looks at the return value of whatever is inside the parentheses in order to determine truth or falseness for the purpose of controlling the while loop. In this case, what is inside the parentheses is an assignment to the scalar variable $line. As Perl sees things, the return value of an assignment operation (that is, the thing that will be tested for truth) is whatever is assigned. So, what’s being assigned? The output of <IN>, which, as I mentioned a few moments ago, is simply a line from the file previously opened for reading and associated with the filehandle IN.

Specifically, the line that is assigned is the “next” line because that is how the <IN> input operator works. The first time it is evaluated it returns the first line of the file, the next time it is evaluated it returns the next line, and so on, until the end of the file is reached, at which point it returns the undefined value, which is false, thereby terminating the loop.

In simpler language, the particular while loop shown here will run once for each line in the file, with the current line being stored in $line for each trip through the loop. In that sense it is somewhat analogous to running a foreach loop on an array consisting of all the lines in the file.

But wait. Some extremely clever and attentive reader out there is wondering about a special case. What if the file being read from contains a blank line, or a line consisting solely of the number 0? When a line like that is read and assigned to the $line variable, the while loop’s test will be evaluating the empty string, or the number 0. Either of those will evaluate to false (because the empty string and the number 0 are both false, according to Perl), thereby terminating the loop before the entire file has been read. Won’t that be a problem?

Well, no. That doesn’t happen, and it doesn’t happen because of a subtle fact about the input operator that you should try hard to remember because if you are anything like me you will fail to remember it many times during your Perl education, leading to some subtle bugs in your code. The reason an empty line, or a line containing only the number 0, doesn’t cause the while test to fail is that the input operator doesn’t just read and return the line itself. It also returns the newline that marks the end of the line. Or, more precisely, it considers that newline to be part of the line, and so returns it along with whatever came before it.

So you see, an empty line in the file isn’t really empty. It consists of a newline, which is not the empty string, and isn’t 0, and so is true, according to Perl. The same is true of a line containing just the number 0: it isn’t just the number 0 (or the string "0") that gets returned by the input operator; it’s the 0 followed by a newline, and hence (again) is true.

If you run this new fix_links.plx script in the directory containing your HTML files, it will print out to your screen the contents of all the files specified in its arguments (or at least, all those that make it past the sanity checks). Pipe the command’s output to the more command, and you can page through that output a screenful at a time until you get tired of doing so, when you can just type q to quit from the pager and return to your shell prompt:

[jbc@andros testsite]$ fix_links.plx *.html | more
<HTML>
<HEAD>
<TITLE>This is the title</TITLE>
</HEAD>
<BODY>

And so on.

Modifying Data with a Substitution Operator

Now that our script is successfully reading in the contents of the HTML files, we just need to assign those contents to a variable, rewrite any links in them that point to the newly renamed files, and print the rewritten contents back out to the file. To do that, we begin by adding the following line just before the line where we open the file for reading:

$content = '';

This sets a $content variable to contain the empty string, such that it will be emptied for each trip through the enclosing foreach loop.

Then, we delete the print statement from inside the while loop, and modify it so that it looks like this:

while ($line = <IN>) {
        
        # for HREF attributes pointing to the current directory,
        # downcase attribute, and rename '.htm' to '.html'

        $line =~ s/HREF="([^"\/]+\.htm)"/HREF="\L$1\El"/gi;
        
        $content .= $line;
    }

This new version of the while loop adds a search-and-replace operation, courtesy of Perl’s substitution operator. The substitution operator, as you saw briefly in the previous chapter, has a search pattern just like a “regular” regular expression, except that it adds a replacement string, which is used to replace the part of the string that matched the search pattern. Figure 4-2 breaks this particular substitution operator down into its component parts.

Figure 4-2. Components of a substitution operator

Notice how a delimiter is in the middle, between the search pattern and the replacement string. Notice also how any optional modifiers go after the final delimiter (that is, after the replacement string). When this substitution operator is run against the $line variable, which holds the current line being read from the current file, whichever part of the string in $line matches the search pattern will be replaced by the replacement string.

This substitution operator is designed to find all the strings of the form HREF="Something.HTM" in the file, and replace them with strings of the form HREF="something.html" (that is, downcasing the attributes and sticking an “l” on the end of the filename extension). So, let’s see how it achieves that. After the initial s (which flags this as the substitution variety of regex) and the opening delimiter, we have the following search pattern:

HREF="([^"\/]+\.htm)"

This pattern looks a bit mind-boggling at first, but don’t let that throw you. Like any regular expression pattern, it will eventually yield its secrets to a determined analysis. You just have to be patient and go through it carefully, one character at a time.

This pattern begins by matching the literal string HREF=". You’ll notice that the non-alphanumeric characters = and " don’t need to be backslashed; they don’t mean anything special by themselves in a regex pattern. You could backslash them if you wanted to, just to be safe, and nothing bad would happen, but I haven’t bothered.

Next comes a left parenthesis, which is a special character in a regex search pattern. It doesn’t match anything in the string being matched against. Instead, it, along with its paired right parenthesis that comes later, serves to mark off a part of the expression for a capturing operation. This means that part of the string being matched against (the part corresponding to the part of the pattern enclosed by the parentheses) is going to be remembered and will be available later (in this case, in the replacement string we’ll be looking at in just a moment).

The next part of the pattern is the following interesting-looking construct: [^"\/]. Those square brackets create something called a character class in the pattern. (If you read More Fun with Shell Expansion earlier in this chapter, you might remember reading about character classes there. Perl’s regex character classes work the same way.)

In essence, a character class is a list of characters, any one of which can match at this point in the pattern. For example, the class [abc] would allow any of the characters a, b, or c to match. Except that, just to make things interesting, this particular character class begins with a caret symbol (^). When a character class begins with a caret symbol, it becomes a negated character class. That is, it becomes a list of characters that can’t match at that point in the pattern. To put it another way, using a negated character class is the same thing as using a normal character class containing all the characters except the ones actually listed in the negated class. So, this particular character class says, in effect, “match any character except a double quote or a forward slash.” (The forward slash, you will notice, is escaped by a leading backslash, so it isn’t interpreted as the end of the regex pattern. The double quote, though, doesn’t need to be backslashed because it has no special meaning in a regex search pattern.)

After this negated character class comes a plus sign (+), which is a regular expression quantifier. A quantifier doesn’t match anything by itself. Instead, it tells the expression how many of the immediately preceding item to match. The plus sign quantifier says “match one or more of the preceding item.” Because the quantifier is “greedy,” it will match as many characters as it can. In this case, the immediately preceding item is the negated character class, which means that this plus sign says, “match any of the characters allowed by the preceding character class. You have to match at least one of them in order for the expression as a whole to successfully match, but you should keep matching until you have matched as many as you can.”

The next part of the pattern says to match a literal period (escaped by a backslash, to turn off its special meaning), then the literal characters htm. Then comes a closing parenthesis, which again is not meant to be matched literally, but instead ends the capturing operation begun earlier. Finally, at the end of the pattern is a literal double quote (which doesn’t have to be escaped, though you could throw a backslash in front of it just to be safe if you weren’t sure).

So much for the search pattern of this substitution regex. The second part, the replacement string, looks like this:

HREF="\L$1\El"

This works pretty much like a regular double-quoted string. We’ve used a couple of interesting twists in this case, though. The first interesting thing is that we’ve used a special scalar variable, $1. This variable holds the part of the text string being matched against that was matched by the part of the regex pattern enclosed by parentheses. (If we had more than one pair of capturing parentheses in the search pattern, we could access the part captured in the leftmost-starting pair via $1, the next via $2, and so on.) Because the replacement string works like a double-quoted string, that variable will be interpolated, meaning it will be replaced by whatever is stored in the variable.

The second interesting thing we’ve done is to use the \L string escape sequence (which works in any double-quoted string, not just the replacement part of a substitution operator) to force all the text that comes after it to be lowercase. The \E escape sequence that comes later tells Perl to stop doing the lowercase thing; that is, it turns off the lowercasing previously turned on by \L. In this case that \E isn’t really necessary because the only remaining characters in the string at that point are the lowercase l (an “L”, not to be confused with the earlier numeral 1 in $1), and a double quote ("), neither of which would have been changed by \L’s lowercasing behavior. Still, I figured it was good form to explicitly end the lowercasing with \E, so you’d see how that was done.

After the replacement string comes the final delimiter and the trailing modifiers, of which there are two in this case: the /g modifier and the /i modifier, both shoved together (the order doesn’t matter). The /i modifier does the same thing it does in a normal, nonsubstituting regex: it makes the search pattern’s alphabetic characters case-insensitive. The /g modifier does something interesting: it makes the substitution operation global, in the sense that after the first substitution has taken place, the substitution operator will continue looking for more places that match, and making more substitutions, until it has gone through the entire string. Without the /g modifier the substitution operator would perform its search and replace operation only once, at the first point in the string where it found a match, leaving the rest of the string untouched.

So, to review this substitution operation as a whole, here’s what it does: it searches through the current line of the current file, looking for HREF attributes of the form: HREF="(something that includes no double quote or slash characters, and ends with .htm)". When it finds one, it captures everything from the double quote at the beginning of the attribute value to the one at the end of the attribute value, then replaces the entire thing with a version of itself that has a lowercase l appended to the end of the attribute value.

The requirement that the captured sequence contain no double quotes is just a way of making sure to capture the entire HREF attribute value (which this script assumes will always be delimited by double quotes), and nothing more. The requirement that the captured sequence contain no slash characters is a way of restricting this replacement operation to working only on HREF attributes that point to HTML files in the current directory. That is, the substitution operator will only modify attributes that don’t contain a full URL (with a leading http://), and don’t have any sort of path component pointing to a different directory. This way, your fix_links.plx script will not try to rewrite HREF attributes that point to files in other directories, or on other web sites. (Note that this will break on non-Unix systems that use something other than a forward slash as a directory separator.)

Returning to the disclaimers offered at the beginning of this chapter, this script is designed to work only in a particular set of circumstances. We’ve assumed that we only want to modify links pointing to files in the current directory. We’ve further assumed that all the links we are interested in rewriting are in HREF attributes delimited by double quotes, with no space, tab, or newline characters on either side of the = joining the HREF to the attribute value. Finally, we’ve assumed that there are no strings of this form in the files in question other than those actually in <A> tags. That’s a lot of assuming, granted, but in these particular (hypothetical) circumstances, the script is good enough to get the job done.

Get Perl for Web Site Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Perl for Web Site Management by John Callender

Modifying HREF Attributes

First Version of the fix_links.plx Script

Reading from a File with a while Loop

Modifying Data with a Substitution Operator

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly