Directory Tools

One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directory—a “folder” in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.

For instance, suppose you need to search all of your Python files in a development directory for a global variable name (perhaps you’ve forgotten where it is used). There are many platform-specific ways to do this (e.g., the grep command in Unix), but Python scripts that accomplish such tasks will work on every platform where Python works—Windows, Unix, Linux, Macintosh, and just about any other platform commonly used today. If you simply copy your script to any machine you wish to use it on, it will work regardless of which other tools are available there.

Walking One Directory

The most common way to go about writing such tools is to first grab a list of the names of the files you wish to process, and then step through that list with a Python for loop, processing each file in turn. The trick we need to learn here, then, is how to get such a directory list within our scripts. There are at least three options: running shell listing commands with os.popen, matching filename patterns with glob.glob, and getting directory listings with os.listdir. They vary in interface, result format, and portability.

Running shell listing commands with os.popen

Quick: how did you go about getting directory file listings before you heard of Python? If you’re new to shell tools programming, the answer may be “Well, I started a Windows file explorer and clicked on stuff,” but I’m thinking here in terms of less GUI-oriented command-line mechanisms (and answers submitted in Perl and Tcl get only partial credit).

On Unix, directory listings are usually obtained by typing ls in a shell; on Windows, they can be generated with a dir command typed in an MS-DOS console box. Because Python scripts may use os.popen to run any command line that we can type in a shell, they are the most general way to grab a directory listing inside a Python program. We met os.popen in the prior chapter; it runs a shell command string and gives us a file object from which we can read the command’s output. To illustrate, let’s first assume the following directory structures (yes, I have both dir and ls commands on my Windows laptop; old habits die hard):

C:\temp>dir /B
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir

C:\temp>ls
about-pp.html     about-ppr2e.html  python1.5.tar.gz
about-pp2e.html   newdir

C:\temp>ls newdir
more   temp1  temp2  temp3

The newdir name is a nested subdirectory in C:\temp here. Now, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line and reading its output (the text normally thrown up on the console window):

C:\temp>python
>>> import os
>>> os.popen('dir /B').readlines( )
['about-pp.html\n', 'python1.5.tar.gz\n', 'about-pp2e.html\n',
'about-ppr2e.html\n', 'newdir\n']

Lines read from a shell command come back with a trailing end-of-line character, but it’s easy enough to slice off with a for loop or list comprehension expression as in the following code:

>>>for line in os.popen('dir /B').readlines( ):
...     print line[:-1]
...
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir

>>> lines = [line[:-1] for line in os.popen('dir /B')]
>>> lines
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html',
'about-ppr2e.html', 'newdir']

One subtle thing: notice that the object returned by os.popen has an iterator that reads one line per request (i.e., per next( ) method call), just like normal files, so calling the readlines method is optional here unless you really need to extract the result list all at once (see the discussion of file iterators earlier in this chapter). For pipe objects, the effect of iterators is even more useful than simply avoiding loading the entire result into memory all at once: readlines will block the caller until the spawned program is completely finished, whereas the iterator might not.

The dir and ls commands let us be specific about filename patterns to be matched and directory names to be listed; again, we’re just running shell commands here, so anything you can type at a shell prompt goes:

>>>os.popen('dir *.html /B').readlines( )
['about-pp.html\n', 'about-pp2e.html\n', 'about-ppr2e.html\n']

>>> os.popen('ls *.html').readlines( )
['about-pp.html\n', 'about-pp2e.html\n', 'about-ppr2e.html\n']

>>> os.popen('dir newdir /B').readlines( )
['temp1\n', 'temp2\n', 'temp3\n', 'more\n']

>>> os.popen('ls newdir').readlines( )
['more\n', 'temp1\n', 'temp2\n', 'temp3\n']

These calls use general tools and work as advertised. As I noted earlier, though, the downsides of os.popen are that it requires using a platform-specific shell command and it incurs a performance hit to start up an independent program. The following two alternative techniques do better on both counts.

The glob module

The term globbing comes from the * wildcard character in filename patterns; per computing folklore, a * matches a “glob” of characters. In less poetic terms, globbing simply means collecting the names of all entries in a directory—files and subdirectories—whose names match a given filename pattern. In Unix shells, globbing expands filename patterns within a command line into all matching filenames before the command is ever run. In Python, we can do something similar by calling the glob.glob built-in with a pattern to expand:

>>>import glob
>>> glob.glob('*')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> glob.glob('*.html')
['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']

>>> glob.glob('newdir/*')
['newdir\\temp1', 'newdir\\temp2', 'newdir\\temp3', 'newdir\\more']

The glob call accepts the usual filename pattern syntax used in shells (e.g., ? means any one character, * means any number of characters, and [] is a character selection set).[*] The pattern should include a directory path if you wish to glob in something other than the current working directory, and the module accepts either Unix or DOS-style directory separators (/ or \). Also, this call is implemented without spawning a shell command and so is likely to be faster and more portable across all Python platforms than the os.popen schemes shown earlier.

Technically speaking, glob is a bit more powerful than described so far. In fact, using it to list files in one directory is just one use of its pattern-matching skills. For instance, it can also be used to collect matching names across multiple directories, simply because each level in a passed-in directory path can be a pattern too:

C:\temp>python
>>> import glob
>>> for name in glob.glob('*examples/L*.py'): print name
...
cpexamples\Launcher.py
cpexamples\Launch_PyGadgets.py
cpexamples\LaunchBrowser.py
cpexamples\launchmodes.py
examples\Launcher.py
examples\Launch_PyGadgets.py
examples\LaunchBrowser.py
examples\launchmodes.py

>>> for name in glob.glob(r'*\*\visitor_find*.py'): print name
...
cpexamples\PyTools\visitor_find.py
cpexamples\PyTools\visitor_find_quiet2.py
cpexamples\PyTools\visitor_find_quiet1.py
examples\PyTools\visitor_find.py
examples\PyTools\visitor_find_quiet2.py
examples\PyTools\visitor_find_quiet1.py

In the first call here, we get back filenames from two different directories that match the *examples pattern; in the second, both of the first directory levels are wildcards, so Python collects all possible ways to reach the base filenames. Using os.popen to spawn shell commands achieves the same effect only if the underlying shell or listing command does too.

The os.listdir call

The os module’s listdir call provides yet another way to collect filenames in a Python list. It takes a simple directory name string, not a filename pattern, and returns a list containing the names of all entries in that directory—both simple files and nested directories—for use in the calling script:

>>>os.listdir('.')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> os.listdir(os.curdir)
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> os.listdir('newdir')
['temp1', 'temp2', 'temp3', 'more']

This too is done without resorting to shell commands and so is portable to all major Python platforms. The result is not in any particular order (but can be sorted with the list sort method), returns base filenames without their directory path prefixes, and includes names of both files and directories at the listed level.

To compare all three listing techniques, let’s run them here side by side on an explicit directory. They differ in some ways but are mostly just variations on a theme—os.popen sorts names and returns end-of-lines, glob.glob accepts a pattern and returns filenames with directory prefixes, and os.listdir takes a simple directory name and returns names without directory prefixes:

>>>os.popen('ls C:\PP3rdEd').readlines( )
['README.txt\n', 'cdrom\n', 'chapters\n', 'etc\n', 'examples\n',
'examples.tar.gz\n', 'figures\n', 'shots\n']

>>> glob.glob('C:\PP3rdEd\*')
['C:\\PP3rdEd\\examples.tar.gz', 'C:\\PP3rdEd\\README.txt',
'C:\\PP3rdEd\\shots', 'C:\\PP3rdEd\\figures', 'C:\\PP3rdEd\\examples',
'C:\\PP3rdEd\\etc', 'C:\\PP3rdEd\\chapters', 'C:\\PP3rdEd\\cdrom']

>>> os.listdir('C:\PP3rdEd')
['examples.tar.gz', 'README.txt', 'shots', 'figures', 'examples', 'etc',
'chapters', 'cdrom']

Of these three, glob and listdir are generally better options if you care about script portability, and listdir seems fastest in recent Python releases (but gauge its performance yourself—implementations may change over time).

Splitting and joining listing results

In the last example, I pointed out that glob returns names with directory paths, whereas listdir gives raw base filenames. For convenient processing, scripts often need to split glob results into base files or expand listdir results into full paths. Such translations are easy if we let the os.path module do all the work for us. For example, a script that intends to copy all files elsewhere will typically need to first split off the base filenames from glob results so that it can add different directory names on the front:

>>>dirname = r'C:\PP3rdEd'
>>> for file in glob.glob(dirname + '/*'):
...     head, tail = os.path.split(file)
...     print head, tail, '=>', ('C:\\Other\\' + tail)
...
C:\PP3rdEd examples.tar.gz => C:\Other\examples.tar.gz
C:\PP3rdEd README.txt => C:\Other\README.txt
C:\PP3rdEd shots => C:\Other\shots
C:\PP3rdEd figures => C:\Other\figures
C:\PP3rdEd examples => C:\Other\examples
C:\PP3rdEd etc => C:\Other\etc
C:\PP3rdEd chapters => C:\Other\chapters
C:\PP3rdEd cdrom => C:\Other\cdrom

Here, the names after the => represent names that files might be moved to. Conversely, a script that means to process all files in a different directory than the one it runs in will probably need to prepend listdir results with the target directory name before passing filenames on to other tools:

>>>for file in os.listdir(dirname):
...     print os.path.join(dirname, file)
...
C:\PP3rdEd\examples.tar.gz
C:\PP3rdEd\README.txt
C:\PP3rdEd\shots
C:\PP3rdEd\figures
C:\PP3rdEd\examples
C:\PP3rdEd\etc
C:\PP3rdEd\chapters
C:\PP3rdEd\cdrom

Walking Directory Trees

As you read the prior section, you may have noticed that all of the preceding techniques return the names of files in only a single directory. What if you want to apply an operation to every file in every directory and subdirectory in an entire directory tree?

For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package: a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher manually in every directory in the tree, but that’s tedious, error prone, and just plain not fun.

Luckily, in Python it’s almost as easy to process a directory tree as it is to inspect a single directory. We can either write a recursive routine to traverse the tree, or use one of two tree-walker utilities built into the os module. Such tools can be used to search, copy, compare, and otherwise process arbitrary directory trees on any platform that Python runs on (and that’s just about everywhere).

The os.path.walk visitor

To make it easy to apply an operation to all files in a tree hierarchy, Python comes with a utility that scans trees for us and runs a provided function at every directory along the way. The os.path.walk function is called with a directory root, function object, and optional data item, and walks the tree at the directory root and below. At each directory, the function object passed in is called with the optional data item, the name of the current directory, and a list of filenames in that directory (obtained from os.listdir). Typically, the function we provide (often referred to as a callback function) scans the filenames list to process files at each directory level in the tree.

That description might sound horribly complex the first time you hear it, but os.path.walk is fairly straightforward once you get the hang of it. In the following code, for example, the lister function is called from os.path.walk at each directory in the tree rooted at .. Along the way, lister simply prints the directory name and all the files at the current level (after prepending the directory name). It’s simpler in Python than in English:

>>>import os
>>> def lister(dummy, dirname, filesindir):
...     print '[' + dirname + ']'
...     for fname in filesindir:
...         print os.path.join(dirname, fname)         # handle one file
...
>>> os.path.walk('.', lister, None)
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
.\newdir
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
.\newdir\more
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt

In other words, we’ve coded our own custom (and easily changed) recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let’s make it permanently available in a module file, as shown in Example 4-4, now that we’ve worked out the details interactively.

Example 4-4. PP3E\System\Filetools\lister_walk.py

# list file tree with os.path.walk
import sys, os

def lister(dummy, dirName, filesInDir):              # called at each dir
    print '[' + dirName + ']'
    for fname in filesInDir:                         # includes subdir names
        path = os.path.join(dirName, fname)          # add dir name prefix
        if not os.path.isdir(path):                  # print simple files only
            print path

if _ _name_ _ == '_ _main_ _':
    os.path.walk(sys.argv[1], lister, None)          # dir name in cmdline

This is the same code except that directory names are filtered out of the filenames list by consulting the os.path.isdir test in order to avoid listing them twice (see, it’s been tweaked already). When packaged this way, the code can also be run from a shell command line. Here it is being launched from a different directory, with the directory to be listed passed in as a command-line argument:

C:\...\PP3E\System\Filetools>python lister_walk.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt

The walk paradigm also allows functions to tailor the set of directories visited by changing the file list argument in place. The library manual documents this further, but it’s probably more instructive to simply know what walk truly looks like. Here is its actual Python-coded implementation for Windows platforms (at the time of this writing), with comments added to help demystify its operation:

def walk(top, func, arg):                  # top is the current dirname
    try:
        names = os.listdir(top)            # get all file/dir names here
    except os.error:                       # they have no path prefix
        return
    func(arg, top, names)                  # run func with names list here
    exceptions = ('.', '..')
    for name in names:                     # step over the very same list
        if name not in exceptions:         # but skip self/parent names
            name = join(top, name)         # add path prefix to name
            if isdir(name):
                walk(name, func, arg)      # descend into subdirs here

Notice that walk generates filename lists at each level with os.listdir, a call that collects both file and directory names in no particular order and returns them without their directory paths. Also note that walk uses the very same list returned by os.listdir and passed to the function you provide in order to later descend into subdirectories (variable names). Because lists are mutable objects that can be changed in place, if your function modifies the passed-in filenames list, it will impact what walk does next. For example, deleting directory names will prune traversal branches, and sorting the list will order the walk.

The os.walk generator

In recent Python releases, a new directory tree walker has been added which does not require a callback function to be coded. This new call, os.walk, is instead a generator function; when used within a for loop, each time through it yields a tuple containing the current directory name, a list of subdirectories in that directory, and a list of nondirectory files in that directory.

Recall that generators have a .next( ) method implicitly invoked by for loops and other iteration contexts; each call forces the walker to the next directory in the tree. Essentially, os.walk replaces the os.path.walk callback function with a loop body, and so it may be easier to use (though you’ll have to judge that for yourself).

For example, suppose you have a directory tree of files and you want to find all Python source files within it that reference the Tkinter GUI module. The traditional way to accomplish this with os.path.walk requires a callback function run at each level of the tree:

>>>import os
>>> def atEachDir(matchlist, dirname, fileshere):
        for filename in fileshere:
            if filename.endswith('.py'):
                pathname = os.path.join(dirname, filename)
                if 'Tkinter' in open(pathname).read( ):
                    matchlist.append(pathname)

>>> matches = []
>>> os.path.walk(r'D:\PP3E', atEachDir, matches)
>>> matches
['D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui.py', 'D:\\PP3E\\dev\\
examples\\PP3E\\Preview\\tkinter101.py', 'D:\\PP3E\\dev\\examples\\PP3E\\
Preview\\tkinter001.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\
peoplegui_class.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\
tkinter102.py', 'D:\\PP3E\\NewExamples\\clock.py', 'D:\\PP3E\\NewExamples
\\calculator.py']

This code loops through all the files at each level, looking for files with .py at the end of their names and which contain the search string. When a match is found, its full name is appended to the results list object, which is passed in as an argument (we could also just build a list of .py files and search each in a for loop after the walk). The equivalent os.walk code is similar, but the callback function’s code becomes the body of a for loop, and directory names are filtered out for us:

>>>import os
>>> matches = []
>>> for (dirname, dirshere, fileshere) in os.walk(r'D:\PP3E'):
        for filename in fileshere:
            if filename.endswith('.py'):
                pathname = os.path.join(dirname, filename)
                if 'Tkinter' in open(pathname).read( ):
                    matches.append(pathname)

>>> matches
['D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui.py', 'D:\\PP3E\\dev\\examples\\
PP3E\\Preview\\tkinter101.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\
tkinter001.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui_class.py', 'D:\\
PP3E\\dev\\examples\\PP3E\\Preview\\tkinter102.py', 'D:\\PP3E\\NewExamples\\
clock.py', 'D:\\PP3E\\NewExamples\\calculator.py']

If you want to see what’s really going on in the os.walk generator, call its next( ) method manually a few times as the for loop does automatically; each time, you advance to the next subdirectory in the tree:

>>>gen = os.walk('D:\PP3E')
>>> gen.next( )
('D:\\PP3E', ['proposal', 'dev', 'NewExamples', 'bkp'], ['prg-python-2.zip'])
>>> gen.next( )
('D:\\PP3E\\proposal', [], ['proposal-programming-python-3e.doc'])
>>> gen.next( )
('D:\\PP3E\\dev', ['examples'], ['ch05.doc', 'ch06.doc', 'ch07.doc', 'ch08.doc',
 'ch09.doc', 'ch10.doc', 'ch11.doc', 'ch12.doc', 'ch13.doc', 'ch14.doc', ...more...

The os.walk generator has more features than I will demonstrate here. For instance, additional arguments allow you to specify a top-down or bottom-up traversal of the directory tree, and the list of subdirectories in the yielded tuple can be modified in-place to change the traversal in top-down mode, much as for os.path.walk. See the Python library manual for more details.

So why the new call? Is the new os.walk easier to use than the traditional os.path.walk? Perhaps, if you need to distinguish between subdirectories and files in each directory (os.walk gives us two lists rather than one) or can make use of a bottom-up traversal or other features. Otherwise, it’s mostly just the trade of a function for a for loop header. You’ll have to judge for yourself whether this is more natural or not; we’ll use both forms in this book.

Recursive os.listdir traversals

The os.path.walk and os.walk tools do tree traversals for us, but it’s sometimes more flexible and hardly any more work to do it ourselves. The following script recodes the directory listing script with a manual recursive traversal function (a function that calls itself to repeat its actions). The mylister function in Example 4-5 is almost the same as lister in Example 4-4 but calls os.listdir to generate file paths manually and calls itself recursively to descend into subdirectories.

Example 4-5. PP3E\System\Filetools\lister_recur.py

# list files in dir tree by recursion 
 

import sys, os

def mylister(currdir):
    print '[' + currdir + ']'
    for file in os.listdir(currdir):              # list files here
        path = os.path.join(currdir, file)        # add dir path back
        if not os.path.isdir(path):
            print path
        else:
            mylister(path)                        # recur into subdirs

if _ _name_ _ == '_ _main_ _':
    mylister(sys.argv[1])                         # dir name in cmdline

This version is packaged as a script too (this is definitely too much code to type at the interactive prompt); its output is identical when run as a script:

C:\...\PP3E\System\Filetools>python lister_recur.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt

But this file is just as useful when imported and called elsewhere:

C:\temp>python
>>> from PP3E.System.Filetools.lister_recur import mylister
>>> mylister('.')
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt

We will make better use of most of this section’s techniques in later examples in Chapter 7 and in this book at large. For example, scripts for copying and comparing directory trees use the tree-walker techniques listed previously. Watch for these tools in action along the way. If you are interested in directory processing, also see the discussion of Python’s old grep module in Chapter 7; it searches files and can be applied to all files in a directory when combined with the glob module, but it simply prints results and does not traverse directory trees by itself.

Rolling Your Own find Module

Another way to go hierarchical is to collect files into a flat list all at once. In the second edition of this book, I included a section on the now-defunct find standard library module, which was used to collect a list of matching filenames in an entire directory tree (much like a Unix find command). Unlike the single-directory tools described earlier, although it returned a flat list, find returned pathnames of matching files nested in subdirectories all the way to the bottom of a tree.

This module is now gone; the os.walk and os.path.walk tools described earlier are recommended as easier-to-use alternatives. On the other hand, it’s not completely clear why the standard find module fell into deprecation; it’s a useful tool. In fact, I used it often—it is nice to be able to grab a simple linear list of matching files in a single function call and step through it in a for loop. The alternatives still seem a bit more code-y and tougher for beginners to digest.

Not to worry though, because instead of lamenting the loss of a module, I decided to spend 10 minutes whipping up a custom equivalent. In fact, one of the nice things about Python is that it is usually easy to do by hand what a built-in tool does for you; many built-ins are just conveniences. The module in Example 4-6 uses the standard os.path.walk call described earlier to reimplement a find operation for use in Python scripts.

Example 4-6. PP3E\PyTools\find.py

#!/usr/bin/python
##############################################################################
# custom version of the now deprecated find module 
 in the
 standard library:
# import as "PyTools.find"; equivalent to the original, but uses os.path.walk,
# has no support for pruning subdirs in the tree, and is instrumented to be
# runnable as a top-level script; uses tuple unpacking in function arguments;
##############################################################################

import fnmatch, os

def find(pattern, startdir=os.curdir):
    matches = []
    os.path.walk(startdir, findvisitor, (matches, pattern))
    matches.sort( )
    return matches

def findvisitor((matches, pattern), thisdir, nameshere):
    for name in nameshere:
        if fnmatch.fnmatch(name, pattern):
            fullpath = os.path.join(thisdir, name)
            matches.append(fullpath)

if _ _name_ _ == '_ _main_ _':
    import sys
    namepattern, startdir = sys.argv[1], sys.argv[2]
    for name in find(namepattern, startdir): print name

There’s not much to this file; but calling its find function provides the same utility as the deprecated find standard module and is noticeably easier than rewriting all of this file’s code every time you need to perform a find-type search. Because this file is instrumented to be both a script and a library, it can be run or called.

For instance, to process every Python file in the directory tree rooted in the current working directory, I simply run the following command line from a system console window. I’m piping the script’s standard output into the more command to page it here, but it can be piped into any processing program that reads its input from the standard input stream:

python find.py *.py . | more

For more control, run the following sort of Python code from a script or interactive prompt (you can also pass in an explicit start directory if you prefer). In this mode, you can apply any operation to the found files that the Python language provides:

from PP3E.PyTools import find
for name in find.find('*.py'):
    ...do something with name...

Notice how this avoids the nested loop structure you wind up coding with os.walk and the callback functions you implement for os.path.walk (see the earlier examples), making it seem conceptually simpler. Its only obvious downside is that your script must wait until all matching files have been found and collected; os.walk yields results as it goes, and os.path.walk calls your function along the way.

Here’s a more concrete example of our find module at work: the following system command line lists all Python files in directory D:\PP3E whose names begin with the letter c or t (it’s being run in the same directory as the find.py file). Notice that find returns full directory paths that begin with the start directory specification.

C:\Python24>python find.py [ct]*.py D:\PP3E
D:\PP3E\NewExamples\calculator.py
D:\PP3E\NewExamples\clock.py
D:\PP3E\NewExamples\commas.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter001.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter101.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter102.py

And here’s some Python code that does the same find but also extracts base names and file sizes for each file found:

>>>import os
>>> from find import find
>>> for name in find('[ct]*.py', r'D:\PP3E'):
...     print os.path.basename(name), '=>', os.path.getsize(name)
...
calculator.py => 14101
clock.py => 11000
commas.py => 2508
tkinter001.py => 62
tkinter101.py => 235
tkinter102.py => 421

As a more useful example, I use the following simple script to clean out any old output text files located anywhere in the book examples tree. I usually run this script from the example’s root directory. I don’t really need the full path to the find module in the import here because it is in the same directory as this script itself; if I ever move this script, though, the full path will be required:

C:\...\PP3E>type PyTools\cleanoutput.py
import os                                  # delete old output files in tree
from PP3E.PyTools.find import find         # only need full path if I'm moved
for filename in find('*.out.txt'):
    print filename
    if raw_input('View?') == 'y':
        print open(filename).read( )
    if raw_input('Delete?') == 'y':
        os.remove(filename)

C:\temp\examples>python %X%\PyTools\cleanoutput.py
.\Internet\Cgi-Web\Basics\languages.out.txt
View?
Delete?
.\Internet\Cgi-Web\PyErrata\AdminTools\dbaseindexed.out.txt
View?
Delete?y

To achieve such code economy, the custom find module calls os.path.walk to register a function to be called per directory in the tree and simply adds matching filenames to the result list along the way.

New here, though, is the fnmatch module—yet another Python standard library module that performs Unix-like pattern matching against filenames. This module supports common operators in name pattern strings: * (to match any number of characters), ? (to match any single character), and [...] and [!...] (to match any character inside the bracket pairs, or not); other characters match themselves.[*] If you haven’t already noticed, the standard library is a fairly amazing collection of tools.

Incidentally, find.find is also roughly equivalent to platform-specific shell commands such as find -print on Unix and Linux, and dir /B /S on DOS and Windows. Since we can usually run such shell commands in a Python script with os.popen, the following does the same work as find.find but is inherently nonportable and must start up a separate program along the way:

>>>import os
>>> for line in os.popen('dir /B /S').readlines( ): print line,
...
C:\temp\about-pp.html
C:\temp\about-pp2e.html
C:\temp\about-ppr2e.html
C:\temp\newdir
C:\temp\newdir\temp1
C:\temp\newdir\temp2
C:\temp\newdir\more
C:\temp\newdir\more\xxx.txt

The equivalent Python metaphors, however, work unchanged across platforms—one of the implicit benefits of writing system utilities in Python:

C:\...>python find.py * .

>>> from find import find
>>> for name in find(pattern='*', startdir='.'): print name

Finally, if you come across older Python code that fails because there is no standard library find to be found, simply change find-module imports in the source code to, say:

from PP3E.PyTools import find

rather than:

import find

The former form will find the custom find module in the book’s example package directory tree. And if you are willing to add the PP3E\PyTools directory to your PYTHONPATH setting, all original import find statements will continue to work unchanged.

Better still, do nothing at all—most find-based examples in this book automatically pick the alternative by catching import exceptions just in case they are run on a more modern Python and their top-level files aren’t located in the PyTools directory:

try:
    import find
except ImportError:
    from PP3E.PyTools import find

The find module may be gone, but it need not be forgotten.



[*] In fact, glob just uses the standard fnmatch module to match name patterns; see the fnmatch description later in this chapter for more details.

[*] Unlike the re module, fnmatch supports only common Unix shell matching operators, not full-blown regular expression patterns; to understand why this matters, see Chapter 18 for more details.

Get Programming Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.