Renaming Files

In this chapter, we’re going to pretend we’re working on a web site that was developed by someone working in a Windows environment (where filenames are case-insensitive, and HTML files typically get .htm filename extensions). Now, however, that site has been moved to a Unix web server, and you wish to make the following changes:

  • Rename all the .htm files so that the filenames end with .html.

  • Change the filenames (some of which feature uppercase letters) so that they are uniformly lowercase.

  • Modify all the HREF attributes contained in those pages so that they match your changes to the filenames.

If you had only a few files to deal with, you could just do all this manually. Filenames could be changed one at a time using the Unix mv (for “move”) command, which has the effect of renaming the file whose name is given in its first argument to the name given in its second argument:

[jbc@andros testsite]$ mv Index.HTM index.html

You could then edit the HREF attributes of each file in a text editor, changing <A HREF="Index.HTM"> to <A HREF="index.html">. And so on.

But what if you have a lot of files that you want to manipulate? At a certain point, the effort of manually making all those changes (and policing the errors that will inevitably creep in as you grind your way through this boring task) is going to be less than the effort of writing a tool to make the changes for you. At this early stage in your education that break-even point will come later (since creating the tool will be a slower process than it will be later on), but you can think of the effort involved as an educational investment that will pay you back many times over in future productivity gains. The plodding, manual approach offers no such promise of future rewards.

In any event, let’s get started.

Globbing

Here’s an ls listing of a directory containing some of those mixed-case filenames:

[jbc@andros testsite]$ ls
Clinton.JPG        Hello_CGI.htm      Sample_Form.htm
Form_to_Email.HTM  Hello_Command.HTM  guestbook_email.htm
Guestbook.HTM      NEXT.HTM           index.htm

The first thing to do is figure out how you’re going to feed all those filenames to your Perl script so that it can do its modifying. There are many ways to do that, but the method you’re going to use here is a feature of the Unix shell called pathname expansion, or, more colloquially, globbing.

If you come from a Windows background, you’re probably familiar with the use of an asterisk (*, pronounced “star”) as a wildcard character when specifying a filename. The asterisk stands for “any number of characters, including no characters” when specifying a filename. This legacy of the DOS command-line environment shows up in dialog boxes’ Filename fields, where you sometimes see things like *.doc to represent “all filenames ending with the characters .doc“, or *.* to represent “all filenames whatsoever.”

DOS/Windows filenames are divided into two parts: the filename itself and a three-character extension, with the period character (., pronounced “dot”) serving as the separator between the two parts. Unix filenames don’t feature the notion of a filename extension, at least not in the formal sense that DOS filenames do. You’re free to stick a .plx or .txt or .walnuts on the end of your Unix filenames, but the operating system doesn’t care that you’ve done so. You can also stick multiple periods in a filename, so you could have a file called this.filename.has.lots.of.dots.

Why am I carrying on about this? For only one reason: in DOS or Windows, the wildcard sequence that allows you to specify “every filename whatsoever” is *.*. In Unix, it’s just *. Let’s try it out now.

Tip

The statement that * stands for “every filename whatsoever” in Unix and Unix-like systems isn’t quite accurate. Files whose names begin with a single period are “hidden,” and won’t be matched by a * wildcard sequence. To match those names, you would need something like .* (“dot star”).

You may not have realized it before this, but you can supply a filename as an argument to the ls command, in which case ls will list information about that file only:

[jbc@andros testsite]$ ls index.htm
index.htm

In fact, you can give a whole bunch of filenames, and ls will dutifully list just those filenames (or, more precisely, just those of the names that correspond to files in the current directory):

[jbc@andros testsite]$ ls index.htm Hello_Command.HTM guestbook_email.htm
Hello_Command.HTM  guestbook_email.htm  index.htm

Now, the tricky and cool part is that you can use an asterisk as a wildcard character, and it will be interpreted as “any character, or any number of characters, including no characters, that will match a filename in the current directory.” So, as mentioned before, a * all by itself means “match any filename in the current directory at all” (except those starting with dots):

[jbc@andros testsite]$ ls *
Clinton.JPG        Hello_CGI.htm      Sample_Form.htm
Form_to_Email.HTM  Hello_Command.HTM  guestbook_email.htm
Guestbook.HTM      NEXT.HTM           index.htm

That’s the same output we got with ls all by itself because ls’s default behavior is to list all the files in the current directory, and that’s (almost) the same thing as saying ls *.

Tip

I said “almost” in the preceding sentence because of subdirectories. In this example no subdirectories were contained within the current directory. If there had been one or more subdirectories, and if those subdirectories’ names did not begin with a dot (.), the output of ls * would have included the contents of those directories, which invoking ls by itself would not have done. That happens because the * matches the names of those subdirectories, which means the ls command would have received those subdirectory names as explicit command-line arguments, and when ls gets a subdirectory name as an argument, it displays the contents of that directory. If that sounded confusing, just ignore it until later, when it will make more sense.

But let’s say we want to list only the filenames ending in .htm. We can just do this:

[jbc@andros testsite]$ ls *.htm
Hello_CGI.htm  Sample_Form.htm  guestbook_email.htm  index.htm

Hmm. There’s that Unix case-sensitivity thing again: we only got the files with lowercase .htm filename endings. But what about those uppercase .HTM files? How can we list those along with the .htm ones? Well, one easy way to do it is by just adding *.HTM to the command’s arguments, like this:

[jbc@andros testsite]$ ls *.htm *.HTM
Form_to_Email.HTM  Hello_CGI.htm      NEXT.HTM         guestbook_email.htm
Guestbook.HTM      Hello_Command.HTM  Sample_Form.htm  index.htm

If you think this wildcard-expansion thing is fun, see More Fun with Shell Expansion for even niftier tricks.

Now, a subtle but potentially very powerful point about all this is that it isn’t actually the ls command that is expanding that * into a list of matching filenames. In fact, it’s the shell that is doing so, and it is only after the shell has done the expansion that it hands off that list of filenames as the argument to ls. In other words, the ls command never sees the literal star (*) character. It only sees the list of filenames that are the result of the shell’s expansion of *. This is powerful because you are not limited to using filename expansion only in the arguments to the ls command. You can also use it in the arguments to any command, including your own custom Perl programs.

In still other words, you can use the shell’s wildcard expansion as a convenient, flexible way to hand off a list of specific filenames to your Perl program for processing.

A Simple Renaming Script

Let’s see a Perl program that renames all the files for you. We’ll build the script from scratch, modifying things as we go along; see Example 4-1 for the final script:

#!/usr/bin/perl -w

# rename.plx - rename files so they end in '.html'

foreach $file (@ARGV) {
    print "got $file\n";
}

Let’s look at this line by line. The first line is just the usual shebang line, with warnings turned on via the -w switch. As mentioned before, if you are using a Perl version equal to or later than 5.6.0, you can do the same thing with a use warnings statement at the beginning of your script. Next is a comment giving the name of the script and a brief description of what it does (or will do, once we’re done creating it).

The next three lines contain a foreach loop. A foreach loop, you will recall, processes each element of an array variable or list, sticking the current item into the scalar variable whose name is given between the foreach keyword and the list, so that item can be accessed during the current trip through the loop.

In this case, the array being processed by the foreach loop is the special array @ARGV. What is the @ARGV array variable, you ask? Well, it turns out to be something special: every script gets it automatically every time it runs, and it contains a list of whatever words came after the script’s name on the command line. We call these additional words arguments. So this foreach loop will run once for each of the script’s arguments, storing the current argument in the variable $file and printing out each element in @ARGV via the print "got $file\n" statement. (Later, we’ll stick the code for renaming the files inside this loop. But for right now, we’ll print a message just to inform us that the foreach loop works.)

Now, remember that the argument list in @ARGV will reflect the result of wildcard expansion by the shell. So, running rename.plx in the directory from our earlier example and giving it an argument of *.htm results in the following output:

[jbc@andros testsite]$ rename.plx *.htm
got Hello_CGI.htm
got Sample_Form.htm
got guestbook_email.htm
got index.htm

Likewise, running it with an argument of *.htm *.HTM gives this:

[jbc@andros testsite]$ rename.plx *.htm *.HTM
got Hello_CGI.htm
got Sample_Form.htm
got guestbook_email.htm
got index.htm
got Form_to_Email.HTM
got Guestbook.HTM
got Hello_Command.HTM
got NEXT.HTM

Now that we’ve seen how to feed a list of filenames to a script and run a foreach loop that processes each filename, here’s a modified version of rename.plx that actually renames files. (If you are creating this script yourself, though, please don’t run it yet; we still need to add some safety features to it.)

#!/usr/bin/perl -w

# rename.plx - rename files so they end in '.html'

foreach $file (@ARGV) {
    $new = lc $file;
    $new = $new . 'l';
    rename $file, $new or die "couldn't rename $file to $new: $!";
}

The script is the same, except for the part inside the foreach loop’s block. Let’s look at that line by line.

First comes this:

$new  = lc $file;

This takes the name of the current file being processed through the foreach loop, makes a lowercase version of it using Perl’s lc function, and assigns that lowercase filename to a new scalar variable called $new.

$new = $new . 'l';

The next line takes that new, lowercase version of the filename and adds a lowercase letter l to the end of it, using the . (“dot”) operator, which is also called the string concatenation operator because it joins, or concatenates, the string on its left with the string on its right, returning the concatenated string.

Because all we’re doing with that concatenated string is storing it back into the $new variable, we can actually write this line a little more concisely using the special operator .= (which I guess you could pronounce “dot equals”). The .= operator has the effect of appending a string to the string currently stored in a variable, and then sticking the concatenated string back into that variable:

$new .= 'l';

Shortening $new = $new . 'l' into $new .= 'l' is a bit like using a contraction when speaking (e.g., saying “won’t” instead of “will not”), and is one of those natural-language-inspired shortcuts in Perl.

Next comes the line that does the actual work:

rename $file, $new or die "couldn't rename $file to $new: $!";

Here, Perl’s rename function is used to take the file named by $file and rename it to $new. If that rename operation fails, the or die part of the line kicks in, terminating the script and printing an error message that includes $!, the special variable containing the error message returned by the system when the operation failed. If the rename function succeeds, everything after the or gets skipped, so the script continues happily to the next pass through the foreach block.

Sanity Checking

Your assembly line appears to be ready to go. If you’re impatient, you’re probably anxious to run your script right now. Resist that impulse. This is the time to look things over carefully with a pessimistic eye, asking yourself what could possibly go wrong and trying to prevent any nasty accidents. Measure twice, cut once, and all that. Swiss Army chainsaw. Hole Hawg.

One potential concern with this script is that the assembly line doesn’t have a quality control inspector. Every item mentioned in @ARGV gets an l appended to it, and then the script tries to rename a file from the old name to the new one. Now, we already discussed how you were planning to run this script with a carefully crafted argument of *.htm *.HTM, which would be expanded by the shell into a list of just the files you wanted, but consider what would happen if you accidentally invoked the script like this:

[jbc@andros testsite]$ rename.plx *

You could accidentally glob up files other than the ones you wanted, appending l’s to their names, too. Bad idea. The answer (one answer, at least) is to put some new code in the foreach block that skips to the next file without doing the renaming if something doesn’t look right. This is a simple example of a common programming practice called a sanity check.

For example, you might want to add a sanity check that excludes files from being renamed if an existing file already has the new name. You could do that by adding the following code just before the line where you rename the file:

if (-e $new) {
    warn "$new already exists. Skipping...\n";
    next;
}

This uses Perl’s -e file test operator, which returns true if the filename given after it corresponds to an existing file. In this case, that means it returns true if $new (which contains the version of the filename with the `l' added to the end) already exists. If that happens, this if block will execute, causing a warning to be printed via the warn function. The warn function is similar to the die function, in that it causes your script to complain to standard error (printing a message to your screen in the case of a script run manually, or to the web server’s error log for a CGI script). Unlike the die function, though, warn lets your script continue running after that point.

After issuing the warning, the if block uses the next function to make your script jump immediately to the next item in the foreach loop, without executing the rest of the statements in the loop. In other words, it causes the script to skip the rename operation for this file.

Another sanity check would skip files that were anything other than “plain” files. This would prevent the script from renaming a file that actually was a directory, for example, or a symbolic link . (In Unix, a symbolic link is a special file that actually just points to some other file.) Here’s how you could implement that sanity check:

unless (-f $file) {
    warn "$file is not a plain file. Skipping...\n";
    next;
}

This check uses Perl’s -f file test operator, which returns true if the filename given in its argument corresponds to a plain file. We used unless here instead of if because we wanted to reverse the sense of the logical test. In other words, we wanted to execute the statements in the block only if the conditional test returned false rather than true. This is precisely what you get with unless.

We’re on quite a roll with these sanity checks, but let’s add two more before we stop. First, let’s prevent files from being renamed if the original name doesn’t end in .htm (or .HTM, or any other case-insensitive variation of that three-letter filename extension). Also, let’s prevent files from being renamed if their names contain forward slashes. That way, at least on a Unix system (where forward slashes are used to separate the directory names in a path), the renaming will be confined to the current directory. To add these features, insert the following before the rename:

unless ($file =~ /\.htm$/i) {
    warn "$file doesn't end in .htm or .HTM. Skipping...\n";
    next;
}

if ($file =~ /\//) {
    warn "$file contains a slash. Skipping...\n";
    next;
}

These logical tests are very interesting. Both of them use the =~ operator to tie the $file variable to a pattern matching operator based on Perl’s regular expressions feature. In each case, the pattern matching operator checks the name stored in the $file variable to see if it matches a particular search pattern, and returns a true value if it does or a false value if it doesn’t.

Before continuing, let’s talk about regular expressions and the associated pattern matching operators a bit more.

Regular Expressions

Regular expressions (which you’ll sometimes hear me refer to as regexes) are extremely powerful. For a beginning programmer, though, they’re almost too powerful; they can seem weird and scary and needlessly complicated. Still, you need to stick with them because they’re important.

As I’ve said, regular expressions are a tool for matching (and, potentially, replacing) specific patterns in a string of text. If you’ve used the “Find” or “Search and Replace” function in a word processor, you have some idea of what regular expressions do, but Perl’s regular expressions are much more powerful than that. Their rich (that is to say, confusingly complex) syntax allows you to specify with astonishing precision exactly what patterns you are looking for and what you want done to them.

In later chapters I’ll be explaining more about regular expressions. For now, let’s just look at a few examples to get an idea of how they work.

We’ll start with the one that looks for filename extensions like .htm (or .HTM, etc.). The whole expression looks like this: /\.htm$/i. The first thing you need to be able to do is break the expression down into its component parts. As Figure 4-1 shows, there are four different parts to this expression: the opening delimiter (/), the search pattern (\.htm$), the closing delimiter (/), and an optional modifier (i).

Parts of a simple regular expression

Figure 4-1. Parts of a simple regular expression

The delimiters are pretty straightforward: a slash to mark the beginning of the expression, and a slash to mark the end. The trailing modifier is easy to understand, too: this particular modifier (which you’ll typically see referred to as the /i modifier) simply makes the expression match case-insensitively.

It’s the regular expression pattern itself, the part between the delimiters, where the powerful magic hangs out. Regular expression patterns use their own specialized language, with lots of special rules and symbols. This pattern is actually fairly simple: \.htm$. Let’s go through it piece by piece, from left to right.

First, the leading backslash-plus-a-period (\.) matches a literal period character. That should give you a hint: a period without a leading backslash does something special in a regular expression. I’m not going to tell you what that something special is until later, because I’d rather you used that part of your brain to remember the following helpful rule about regular expression patterns. An alphanumeric character (the characters A through Z, a through z, and 0 through 9) always just stands for itself. A nonalphanumeric character, though, can sometimes mean something special. About a dozen of these nonalphanumeric characters have special meanings inside a regex; I’ll be introducing them as we go along.

Stick a backslash (\) in front of a nonalphanumeric character in your regex, though, and that special character will always revert to having its ordinary, literal meaning for matching purposes. That’s what we’ve done in this pattern: we wanted to match a literal period, so we put a backslash in front of it.

The next three characters (htm) just match themselves. That is, they will match the literal characters h, and t, and m, one right after the other, in that order. Also, because of that trailing /i modifier, each will also match the uppercase version of itself, such that HTM (and hTM, and Htm, etc.) would all match, too.

Alphanumeric characters work in the opposite way from nonalphanumeric characters. What I mean is, an alphanumeric character always stands for itself, unless you put a backslash in front of it, in which case it gets some special meaning (like \n, which gives you a newline in a regex pattern, just like it does in a double-quoted string).

All of which brings us to the last thing in this pattern: the trailing $. It’s not an alphanumeric character, and it doesn’t have a leading backslash, so that should give you a hint that it might be doing something special. And in fact it is: when a dollar sign ($) is used at the very end of a regular expression pattern, it means that the pattern that precedes it can match only if it occurs at the end of the string. In other words, the $ doesn’t match anything itself, but it makes it so that the rest of the pattern can match only if it comes at the very end of the string being matched against.

So, in this particular example, our pattern will match a string only if that string ends with the literal sequence .htm (or .HTM, .HtM, or whatever). A string like this: `this string has an .htm, but not at the end' would not produce a match with this particular pattern (but take out the $ at the end of the pattern, and it would).

Now let’s look at the regex in the don’t-allow-any-slashes sanity check: /\//. This expression is actually a good deal simpler than the first one. There’s just an opening delimiter (/), the pattern itself (\/), and the closing delimiter (/). The pattern itself just matches a literal slash character, courtesy of the backslash in front of it. Without that backslash, Perl would think the slash in the pattern was actually the closing delimiter.

But for a simple pattern it sure looks confusing. The slash character doesn’t have any special meaning in the regex pattern itself; it only has to be backslashed because of its role as the pattern’s delimiter. It would be really nice if there was a way to use some other character to delimit the search pattern in this case, so we didn’t have to backslash the slash. And, as it turns out, there is a way to do that: put an m (for “matching operator”) in front of the expression, and then choose whatever we want for the delimiter. So, for example, that same regex could have been written as m#/#, or m|/|, either of which is arguably more readable than the original version. We also could choose a paired delimiter, like parentheses or braces, in which case the closing delimiter would be the closing member of the pair: m{/}. That one’s my personal favorite, so let’s update the code in fix_links.plx to use that version.

Summing up, the first of our regex-using sanity checks, which begins with this line:

unless ($file =~ /\.htm$/i) {

will fire off only if the filename in $file fails to end in the literal string .htm (or .HTM, etc.). The second of our regex-using sanity checks, which now begins with this line:

if ($file =~ m{/}) {

will fire off only if the filename in $file contains a slash character.

Running the Renaming Script

We could go on adding sanity checks all day, but I think we’ve been sufficiently paranoid for now. Now that the rename.plx script is finished, it should look like Example 4-1 (which you can download from this book’s script repository, at http://www.elanus.net/book/, if you want to play around with it).

Example 4-1. A script for renaming listed files to have lowercase filenames ending in .html

#!/usr/bin/perl -w

# rename.plx - rename files so they end in '.html'

foreach $file (@ARGV) {
    $new  = lc $file;
    $new .= 'l';
    
    if (-e $new) {
        warn "$new already exists. Skipping...\n";
        next;
    }

    unless (-f $file) {
        warn "$file is not a plain file. Skipping...\n";
        next;
    }
    
    unless ($file =~ /\.htm$/i) {
        warn "$file doesn't end in .htm or .HTM. Skipping...\n";
        next;
    }

    if ($file =~ m{/}) {
        warn "$file contains a slash. Skipping...\n";
        next;
    }
    rename $file, $new or die "couldn't rename $file to $new: $!";
}

Running it in our directory full of wackily named files, and using ls to look at the filenames before and after, results in the following:

[jbc@andros testsite]$ ls
Clinton.JPG        Hello_CGI.htm      Sample_Form.htm      rename.plx
Form_to_Email.HTM  Hello_Command.HTM  guestbook_email.htm
Guestbook.HTM      NEXT.HTM           index.htm
[jbc@andros testsite]$ rename.plx *.htm *.HTM
[jbc@andros testsite]$ ls
Clinton.JPG         guestbook_email.html  index.html  sample_form.html
form_to_email.html  hello_cgi.html        next.html
guestbook.html      hello_command.html    rename.plx

All the files whose names ended in .htm or .HTM have been renamed so that their filenames are uniformly lowercase and have .html extensions.

Get Perl for Web Site Management now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.