One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directory—a “folder” in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.
For instance, suppose you need to search all of your Python
files in a development directory for a global variable name (perhaps
you’ve forgotten where it is used). There are many platform-specific
ways to do this (e.g., the grep
command in Unix), but Python scripts that accomplish such tasks will
work on every platform where Python works—Windows, Unix, Linux,
Macintosh, and just about any other platform commonly used today. If
you simply copy your script to any machine you wish to use it on, it
will work regardless of which other tools are available there.
The most common way to go about writing such tools is to first
grab a list of the names of the files you wish to process, and then
step through that list with a Python for
loop, processing each file in turn.
The trick we need to learn here, then, is how to get such a
directory list within our scripts. There are at least three options:
running shell listing commands with os.popen
, matching filename patterns with glob.glob
, and getting directory listings
with os.listdir
. They vary in
interface, result format, and portability.
Quick: how did you go about getting directory file listings before you heard of Python? If you’re new to shell tools programming, the answer may be “Well, I started a Windows file explorer and clicked on stuff,” but I’m thinking here in terms of less GUI-oriented command-line mechanisms (and answers submitted in Perl and Tcl get only partial credit).
On Unix, directory listings are usually obtained by typing
ls
in a shell; on Windows, they
can be generated with a dir
command typed in an MS-DOS console box. Because Python scripts may
use os.popen
to run any command
line that we can type in a shell, they are the most general way to
grab a directory listing inside a Python program. We met os.popen
in the prior chapter; it runs a
shell command string and gives us a file object from which we can
read the command’s output. To illustrate, let’s first assume the
following directory structures (yes, I have both dir
and ls
commands on my Windows laptop; old
habits die hard):
C:\temp>dir /B
about-pp.html python1.5.tar.gz about-pp2e.html about-ppr2e.html newdir C:\temp>ls
about-pp.html about-ppr2e.html python1.5.tar.gz about-pp2e.html newdir C:\temp>ls newdir
more temp1 temp2 temp3
The newdir name is a nested subdirectory in C:\temp here. Now, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line and reading its output (the text normally thrown up on the console window):
C:\temp>python
>>>import os
>>>os.popen('dir /B').readlines( )
['about-pp.html\n', 'python1.5.tar.gz\n', 'about-pp2e.html\n', 'about-ppr2e.html\n', 'newdir\n']
Lines read from a shell command come back with a trailing
end-of-line character, but it’s easy enough to slice off with a
for
loop or list comprehension
expression as in the following code:
>>>for line in os.popen('dir /B').readlines( ):
...print line[:-1]
... about-pp.html python1.5.tar.gz about-pp2e.html about-ppr2e.html newdir >>>lines = [line[:-1] for line in os.popen('dir /B')]
>>>lines
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html', 'newdir']
One subtle thing: notice that the object returned by
os.popen
has an iterator that
reads one line per request (i.e., per next( )
method call), just like normal
files, so calling the readlines
method is optional here unless you really need to extract the
result list all at once (see the discussion of file iterators
earlier in this chapter). For pipe objects, the effect of
iterators is even more useful than simply avoiding loading the
entire result into memory all at once: readlines
will block the caller until
the spawned program is completely finished, whereas the iterator
might not.
The dir
and ls
commands let us be specific about
filename patterns to be matched and directory names to be listed;
again, we’re just running shell commands here, so anything you can
type at a shell prompt goes:
>>>os.popen('dir *.html /B').readlines( )
['about-pp.html\n', 'about-pp2e.html\n', 'about-ppr2e.html\n'] >>>os.popen('ls *.html').readlines( )
['about-pp.html\n', 'about-pp2e.html\n', 'about-ppr2e.html\n'] >>>os.popen('dir newdir /B').readlines( )
['temp1\n', 'temp2\n', 'temp3\n', 'more\n'] >>>os.popen('ls newdir').readlines( )
['more\n', 'temp1\n', 'temp2\n', 'temp3\n']
These calls use general tools and work as advertised. As I
noted earlier, though, the downsides of os.popen
are that it requires using a
platform-specific shell command and it incurs a performance hit to
start up an independent program. The following two alternative
techniques do better on both counts.
The term globbing comes from
the *
wildcard character in
filename patterns; per computing folklore, a *
matches a “glob” of characters. In
less poetic terms, globbing simply means collecting the names of
all entries in a directory—files and subdirectories—whose names
match a given filename pattern. In Unix shells, globbing expands
filename patterns within a command line into all matching
filenames before the command is ever run. In Python, we can do
something similar by calling the glob.glob
built-in with a pattern to
expand:
>>>import glob
>>>glob.glob('*')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html', 'newdir'] >>>glob.glob('*.html')
['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html'] >>>glob.glob('newdir/*')
['newdir\\temp1', 'newdir\\temp2', 'newdir\\temp3', 'newdir\\more']
The glob
call accepts the
usual filename pattern syntax used in shells (e.g., ?
means any one character, *
means any number of characters, and
[]
is a character selection
set).[*] The pattern should include a directory path if you
wish to glob in something other than the current working
directory, and the module accepts either Unix or DOS-style
directory separators (/
or
\
). Also, this call is
implemented without spawning a shell command and so is likely to
be faster and more portable across all Python platforms than the
os.popen
schemes shown
earlier.
Technically speaking, glob
is a bit more powerful than
described so far. In fact, using it to list files in one directory
is just one use of its pattern-matching skills. For instance, it
can also be used to collect matching names across multiple
directories, simply because each level in a passed-in directory
path can be a pattern too:
C:\temp>python
>>>import glob
>>>for name in glob.glob('*examples/L*.py'): print name
... cpexamples\Launcher.py cpexamples\Launch_PyGadgets.py cpexamples\LaunchBrowser.py cpexamples\launchmodes.py examples\Launcher.py examples\Launch_PyGadgets.py examples\LaunchBrowser.py examples\launchmodes.py >>>for name in glob.glob(r'*\*\visitor_find*.py'): print name
... cpexamples\PyTools\visitor_find.py cpexamples\PyTools\visitor_find_quiet2.py cpexamples\PyTools\visitor_find_quiet1.py examples\PyTools\visitor_find.py examples\PyTools\visitor_find_quiet2.py examples\PyTools\visitor_find_quiet1.py
In the first call here, we get back filenames from two
different directories that match the *examples
pattern; in the second, both
of the first directory levels are wildcards, so Python collects
all possible ways to reach the base filenames. Using os.popen
to spawn shell commands
achieves the same effect only if the underlying shell or listing
command does too.
The os
module’s
listdir
call provides yet
another way to collect filenames in a Python list. It takes a
simple directory name string, not a filename pattern, and returns
a list containing the names of all entries in that directory—both
simple files and nested directories—for use in the calling
script:
>>>os.listdir('.')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html', 'newdir'] >>>os.listdir(os.curdir)
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html', 'newdir'] >>>os.listdir('newdir')
['temp1', 'temp2', 'temp3', 'more']
This too is done without resorting to shell commands and so
is portable to all major Python platforms. The result is not in
any particular order (but can be sorted with the list sort
method), returns base filenames
without their directory path prefixes, and includes names of both
files and directories at the listed level.
To compare all three listing techniques, let’s run them here
side by side on an explicit directory. They differ in some ways
but are mostly just variations on a theme—os.popen
sorts names and returns
end-of-lines, glob.glob
accepts
a pattern and returns filenames with directory prefixes, and
os.listdir
takes a simple
directory name and returns names without directory
prefixes:
>>>os.popen('ls C:\PP3rdEd').readlines( )
['README.txt\n', 'cdrom\n', 'chapters\n', 'etc\n', 'examples\n', 'examples.tar.gz\n', 'figures\n', 'shots\n'] >>>glob.glob('C:\PP3rdEd\*')
['C:\\PP3rdEd\\examples.tar.gz', 'C:\\PP3rdEd\\README.txt', 'C:\\PP3rdEd\\shots', 'C:\\PP3rdEd\\figures', 'C:\\PP3rdEd\\examples', 'C:\\PP3rdEd\\etc', 'C:\\PP3rdEd\\chapters', 'C:\\PP3rdEd\\cdrom'] >>>os.listdir('C:\PP3rdEd')
['examples.tar.gz', 'README.txt', 'shots', 'figures', 'examples', 'etc', 'chapters', 'cdrom']
Of these three, glob
and
listdir
are generally better
options if you care about script portability, and listdir
seems fastest in recent Python
releases (but gauge its performance yourself—implementations may
change over time).
In the last example, I pointed out that glob
returns names with directory paths,
whereas listdir
gives raw base
filenames. For convenient processing, scripts often need to split
glob
results into base files or
expand listdir
results into
full paths. Such translations are easy if we let the os.path
module do all the work for us.
For example, a script that intends to copy all files elsewhere
will typically need to first split off the base filenames from
glob
results so that it can add
different directory names on the front:
>>>dirname = r'C:\PP3rdEd'
>>>for file in glob.glob(dirname + '/*'):
...head, tail = os.path.split(file)
...print head, tail, '=>', ('C:\\Other\\' + tail)
... C:\PP3rdEd examples.tar.gz => C:\Other\examples.tar.gz C:\PP3rdEd README.txt => C:\Other\README.txt C:\PP3rdEd shots => C:\Other\shots C:\PP3rdEd figures => C:\Other\figures C:\PP3rdEd examples => C:\Other\examples C:\PP3rdEd etc => C:\Other\etc C:\PP3rdEd chapters => C:\Other\chapters C:\PP3rdEd cdrom => C:\Other\cdrom
Here, the names after the =>
represent names that files might
be moved to. Conversely, a script that means to process all files
in a different directory than the one it runs in will probably
need to prepend listdir
results
with the target directory name before passing filenames on to
other tools:
>>>for file in os.listdir(dirname):
...print os.path.join(dirname, file)
... C:\PP3rdEd\examples.tar.gz C:\PP3rdEd\README.txt C:\PP3rdEd\shots C:\PP3rdEd\figures C:\PP3rdEd\examples C:\PP3rdEd\etc C:\PP3rdEd\chapters C:\PP3rdEd\cdrom
As you read the prior section, you may have noticed that all of the preceding techniques return the names of files in only a single directory. What if you want to apply an operation to every file in every directory and subdirectory in an entire directory tree?
For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package: a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher manually in every directory in the tree, but that’s tedious, error prone, and just plain not fun.
Luckily, in Python it’s almost as easy to process a directory
tree as it is to inspect a single directory. We can either write a
recursive routine to traverse the tree, or use one of two
tree-walker utilities built into the os
module. Such tools can be used to
search, copy, compare, and otherwise process arbitrary directory
trees on any platform that Python runs on (and that’s just about
everywhere).
To make it easy to apply an operation to all files
in a tree hierarchy, Python comes with a utility that scans trees
for us and runs a provided function at every directory along the
way. The os.path.walk
function
is called with a directory root, function object, and optional
data item, and walks the tree at the directory root and below. At
each directory, the function object passed in is called with the
optional data item, the name of the current directory, and a list
of filenames in that directory (obtained from os.listdir
). Typically, the function we
provide (often referred to as a callback
function) scans the filenames list to process files at each
directory level in the tree.
That description might sound horribly complex the first time
you hear it, but os.path.walk
is fairly straightforward once you get the hang of it. In the
following code, for example, the lister
function is called from os.path.walk
at each directory in the
tree rooted at .
. Along the
way, lister
simply prints the
directory name and all the files at the current level (after
prepending the directory name). It’s simpler in Python than in
English:
>>>import os
>>>def lister(dummy, dirname, filesindir):
...print '[' + dirname + ']'
...for fname in filesindir:
...print os.path.join(dirname, fname)
# handle one file ... >>>os.path.walk('.', lister, None)
[.] .\about-pp.html .\python1.5.tar.gz .\about-pp2e.html .\about-ppr2e.html .\newdir [.\newdir] .\newdir\temp1 .\newdir\temp2 .\newdir\temp3 .\newdir\more [.\newdir\more] .\newdir\more\xxx.txt .\newdir\more\yyy.txt
In other words, we’ve coded our own custom (and easily changed) recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let’s make it permanently available in a module file, as shown in Example 4-4, now that we’ve worked out the details interactively.
Example 4-4. PP3E\System\Filetools\lister_walk.py
# list file tree with os.path.walk import sys, os def lister(dummy, dirName, filesInDir): # called at each dir print '[' + dirName + ']' for fname in filesInDir: # includes subdir names path = os.path.join(dirName, fname) # add dir name prefix if not os.path.isdir(path): # print simple files only print path if _ _name_ _ == '_ _main_ _': os.path.walk(sys.argv[1], lister, None) # dir name in cmdline
This is the same code except that directory names are
filtered out of the filenames list by consulting the os.path.isdir
test in order to avoid
listing them twice (see, it’s been tweaked already). When packaged
this way, the code can also be run from a shell command line. Here
it is being launched from a different directory, with the
directory to be listed passed in as a command-line
argument:
C:\...\PP3E\System\Filetools>python lister_walk.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt
The walk
paradigm also
allows functions to tailor the set of directories visited by
changing the file list argument in place. The library manual
documents this further, but it’s probably more instructive to
simply know what walk
truly
looks like. Here is its actual Python-coded implementation for
Windows platforms (at the time of this writing), with comments
added to help demystify its operation:
def walk(top, func, arg): # top is the current dirname try: names = os.listdir(top) # get all file/dir names here except os.error: # they have no path prefix return func(arg, top, names) # run func with names list here exceptions = ('.', '..') for name in names: # step over the very same list if name not in exceptions: # but skip self/parent names name = join(top, name) # add path prefix to name if isdir(name): walk(name, func, arg) # descend into subdirs here
Notice that walk
generates filename lists at each level with os.listdir
, a call that collects both
file and directory names in no particular order and returns them
without their directory paths. Also note that walk
uses the very same list returned by
os.listdir
and passed to the
function you provide in order to later descend into subdirectories
(variable names
). Because lists
are mutable objects that can be changed in place, if your function
modifies the passed-in filenames list, it will impact what
walk
does next. For example,
deleting directory names will prune traversal branches, and
sorting the list will order the walk.
In recent Python releases, a new directory tree
walker has been added which does not require a callback function
to be coded. This new call, os.walk
, is instead a generator
function; when used within a for
loop, each time through it yields a
tuple containing the current directory name, a list of
subdirectories in that directory, and a list of nondirectory files
in that directory.
Recall that generators have a .next( )
method implicitly invoked by
for
loops and other iteration
contexts; each call forces the walker to the next directory in the
tree. Essentially, os.walk
replaces the os.path.walk
callback function with a loop body, and so it may be easier to use
(though you’ll have to judge that for yourself).
For example, suppose you have a directory tree of files and
you want to find all Python source files within it that reference
the Tkinter
GUI module. The
traditional way to accomplish this with os.path.walk
requires a callback
function run at each level of the tree:
>>>import os
>>>def atEachDir(matchlist, dirname, fileshere):
for filename in fileshere:
if filename.endswith('.py'):
pathname = os.path.join(dirname, filename)
if 'Tkinter' in open(pathname).read( ):
matchlist.append(pathname)
>>>matches = []
>>>os.path.walk(r'D:\PP3E', atEachDir, matches)
>>>matches
['D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui.py', 'D:\\PP3E\\dev\\ examples\\PP3E\\Preview\\tkinter101.py', 'D:\\PP3E\\dev\\examples\\PP3E\\ Preview\\tkinter001.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\ peoplegui_class.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\ tkinter102.py', 'D:\\PP3E\\NewExamples\\clock.py', 'D:\\PP3E\\NewExamples \\calculator.py']
This code loops through all the files at each level, looking
for files with .py at the end of their names
and which contain the search string. When a match is found, its
full name is appended to the results list object, which is passed
in as an argument (we could also just build a list of
.py files and search each in a for
loop after the walk). The equivalent
os.walk
code is similar, but
the callback function’s code becomes the body of a for
loop, and directory names are
filtered out for us:
>>>import os
>>>matches = []
>>>for (dirname, dirshere, fileshere) in os.walk(r'D:\PP3E'):
for filename in fileshere:
if filename.endswith('.py'):
pathname = os.path.join(dirname, filename)
if 'Tkinter' in open(pathname).read( ):
matches.append(pathname)
>>>matches
['D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui.py', 'D:\\PP3E\\dev\\examples\\ PP3E\\Preview\\tkinter101.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\ tkinter001.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui_class.py', 'D:\\ PP3E\\dev\\examples\\PP3E\\Preview\\tkinter102.py', 'D:\\PP3E\\NewExamples\\ clock.py', 'D:\\PP3E\\NewExamples\\calculator.py']
If you want to see what’s really going on in the os.walk
generator, call its next( )
method manually a few times as
the for
loop does
automatically; each time, you advance to the next subdirectory in
the tree:
>>>gen = os.walk('D:\PP3E')
>>>gen.next( )
('D:\\PP3E', ['proposal', 'dev', 'NewExamples', 'bkp'], ['prg-python-2.zip']) >>>gen.next( )
('D:\\PP3E\\proposal', [], ['proposal-programming-python-3e.doc']) >>>gen.next( )
('D:\\PP3E\\dev', ['examples'], ['ch05.doc', 'ch06.doc', 'ch07.doc', 'ch08.doc', 'ch09.doc', 'ch10.doc', 'ch11.doc', 'ch12.doc', 'ch13.doc', 'ch14.doc', ...more...
The os.walk
generator has
more features than I will demonstrate here. For instance,
additional arguments allow you to specify a top-down or bottom-up
traversal of the directory tree, and the list of subdirectories in
the yielded tuple can be modified in-place to change the traversal
in top-down mode, much as for os.path.walk
. See the Python library
manual for more details.
So why the new call? Is the new os.walk
easier to use than the
traditional os.path.walk
?
Perhaps, if you need to distinguish between subdirectories and
files in each directory (os.walk
gives us two lists rather than
one) or can make use of a bottom-up traversal or other features.
Otherwise, it’s mostly just the trade of a function for a for
loop header. You’ll have to judge
for yourself whether this is more natural or not; we’ll use both
forms in this book.
The os.path.walk
and os.walk
tools do tree traversals for us, but it’s sometimes
more flexible and hardly any more work to do it ourselves. The
following script recodes the directory listing script with a
manual recursive traversal function (a
function that calls itself to repeat its actions). The mylister
function in Example 4-5 is almost the same
as lister
in Example 4-4 but calls os.listdir
to generate file paths
manually and calls itself recursively to descend into
subdirectories.
Example 4-5. PP3E\System\Filetools\lister_recur.py
# list files in dir tree by recursion import sys, os def mylister(currdir): print '[' + currdir + ']' for file in os.listdir(currdir): # list files here path = os.path.join(currdir, file) # add dir path back if not os.path.isdir(path): print path else: mylister(path) # recur into subdirs if _ _name_ _ == '_ _main_ _': mylister(sys.argv[1]) # dir name in cmdline
This version is packaged as a script too (this is definitely too much code to type at the interactive prompt); its output is identical when run as a script:
C:\...\PP3E\System\Filetools>python lister_recur.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt
But this file is just as useful when imported and called elsewhere:
C:\temp>python
>>>from PP3E.System.Filetools.lister_recur import mylister
>>>mylister('.')
[.] .\about-pp.html .\python1.5.tar.gz .\about-pp2e.html .\about-ppr2e.html [.\newdir] .\newdir\temp1 .\newdir\temp2 .\newdir\temp3 [.\newdir\more] .\newdir\more\xxx.txt .\newdir\more\yyy.txt
We will make better use of most of this section’s techniques
in later examples in Chapter
7 and in this book at large. For example, scripts for
copying and comparing directory trees use the tree-walker
techniques listed previously. Watch for these tools in action
along the way. If you are interested in directory processing, also
see the discussion of Python’s old grep
module in Chapter 7; it searches files and
can be applied to all files in a directory when combined with the
glob
module, but it simply
prints results and does not traverse directory trees by
itself.
Another way to go hierarchical is to collect files
into a flat list all at once. In the second edition of this book, I
included a section on the now-defunct find
standard library module, which was
used to collect a list of matching filenames in an entire directory
tree (much like a Unix find
command). Unlike the single-directory tools described earlier,
although it returned a flat list, find
returned pathnames of matching files
nested in subdirectories all the way to the bottom of a tree.
This module is now gone; the os.walk
and os.path.walk
tools described earlier are
recommended as easier-to-use alternatives. On the other hand, it’s
not completely clear why the standard find
module fell into deprecation; it’s a
useful tool. In fact, I used it often—it is nice to be able to grab
a simple linear list of matching files in a single function call and
step through it in a for
loop.
The alternatives still seem a bit more code-y and tougher for
beginners to digest.
Not to worry though, because instead of lamenting the loss of
a module, I decided to spend 10 minutes whipping up a custom
equivalent. In fact, one of the nice things about Python is that it
is usually easy to do by hand what a built-in tool does for you;
many built-ins are just conveniences. The module in Example 4-6 uses the standard
os.path.walk
call described
earlier to reimplement a find
operation for use in Python scripts.
Example 4-6. PP3E\PyTools\find.py
#!/usr/bin/python ############################################################################## # custom version of the now deprecated find module in the standard library: # import as "PyTools.find"; equivalent to the original, but uses os.path.walk, # has no support for pruning subdirs in the tree, and is instrumented to be # runnable as a top-level script; uses tuple unpacking in function arguments; ############################################################################## import fnmatch, os def find(pattern, startdir=os.curdir): matches = [] os.path.walk(startdir, findvisitor, (matches, pattern)) matches.sort( ) return matches def findvisitor((matches, pattern), thisdir, nameshere): for name in nameshere: if fnmatch.fnmatch(name, pattern): fullpath = os.path.join(thisdir, name) matches.append(fullpath) if _ _name_ _ == '_ _main_ _': import sys namepattern, startdir = sys.argv[1], sys.argv[2] for name in find(namepattern, startdir): print name
There’s not much to this file; but calling its find
function provides the same utility as
the deprecated find
standard
module and is noticeably easier than rewriting all of this file’s
code every time you need to perform a find-type search. Because this
file is instrumented to be both a script and a library, it can be
run or called.
For instance, to process every Python file in the directory
tree rooted in the current working directory, I simply run the
following command line from a system console window. I’m piping the
script’s standard output into the more
command to page it here, but it can
be piped into any processing program that reads its input from the
standard input stream:
python find.py *.py . | more
For more control, run the following sort of Python code from a script or interactive prompt (you can also pass in an explicit start directory if you prefer). In this mode, you can apply any operation to the found files that the Python language provides:
from PP3E.PyTools import find for name in find.find('*.py'): ...do something with name...
Notice how this avoids the nested loop structure you wind up
coding with os.walk
and the
callback functions you implement for os.path.walk
(see the earlier examples),
making it seem conceptually simpler. Its only obvious downside is
that your script must wait until all matching files have been found
and collected; os.walk
yields
results as it goes, and os.path.walk
calls your function along the
way.
Here’s a more concrete example of our find
module at work: the following system
command line lists all Python files in directory
D:\PP3E whose names begin with the letter
c or t (it’s being run in
the same directory as the find.py file). Notice
that find
returns full directory
paths that begin with the start directory specification.
C:\Python24>python find.py [ct]*.py D:\PP3E
D:\PP3E\NewExamples\calculator.py
D:\PP3E\NewExamples\clock.py
D:\PP3E\NewExamples\commas.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter001.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter101.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter102.py
And here’s some Python code that does the same find but also extracts base names and file sizes for each file found:
>>>import os
>>>from find import find
>>>for name in find('[ct]*.py', r'D:\PP3E'):
...print os.path.basename(name), '=>', os.path.getsize(name)
... calculator.py => 14101 clock.py => 11000 commas.py => 2508 tkinter001.py => 62 tkinter101.py => 235 tkinter102.py => 421
As a more useful example, I use the following simple script to
clean out any old output text files located anywhere in the book
examples tree. I usually run this script from the example’s root
directory. I don’t really need the full path to the find
module in the import here because it
is in the same directory as this script itself; if I ever move this
script, though, the full path will be required:
C:\...\PP3E>type PyTools\cleanoutput.py
import os # delete old output files in tree from PP3E.PyTools.find import find # only need full path if I'm moved for filename in find('*.out.txt'): print filename if raw_input('View?') == 'y': print open(filename).read( ) if raw_input('Delete?') == 'y': os.remove(filename) C:\temp\examples>python %X%\PyTools\cleanoutput.py
.\Internet\Cgi-Web\Basics\languages.out.txt View? Delete? .\Internet\Cgi-Web\PyErrata\AdminTools\dbaseindexed.out.txt View? Delete?y
To achieve such code economy, the custom find
module calls os.path.walk
to register a function to be
called per directory in the tree and simply adds matching filenames
to the result list along the way.
New here, though, is the fnmatch
module—yet another Python standard
library module that performs Unix-like pattern matching against
filenames. This module supports common operators in name pattern
strings: *
(to match any number
of characters), ?
(to match any
single character), and [...]
and
[!...]
(to match any character
inside the bracket pairs, or not); other characters match
themselves.[*] If you haven’t already noticed, the standard library
is a fairly amazing collection of tools.
Incidentally, find.find
is
also roughly equivalent to platform-specific shell commands such as
find -print
on Unix and Linux,
and dir /B /S
on DOS and Windows.
Since we can usually run such shell commands in a Python script with
os.popen
, the following does the
same work as find.find
but is
inherently nonportable and must start up a separate program along
the way:
>>>import os
>>>for line in os.popen('dir /B /S').readlines( ): print line,
... C:\temp\about-pp.html C:\temp\about-pp2e.html C:\temp\about-ppr2e.html C:\temp\newdir C:\temp\newdir\temp1 C:\temp\newdir\temp2 C:\temp\newdir\more C:\temp\newdir\more\xxx.txt
The equivalent Python metaphors, however, work unchanged across platforms—one of the implicit benefits of writing system utilities in Python:
C:\...>python find.py * .
>>>from find import find
>>>for name in find(pattern='*', startdir='.'): print name
Finally, if you come across older Python code that fails
because there is no standard library find
to be found, simply change find
-module imports in the source code to,
say:
from PP3E.PyTools import find
rather than:
import find
The former form will find the custom find
module in the book’s example package
directory tree. And if you are willing to add the
PP3E\PyTools directory to your PYTHONPATH
setting, all original import find
statements will continue to
work unchanged.
Better still, do nothing at all—most find
-based examples in this book
automatically pick the alternative by catching import exceptions
just in case they are run on a more modern Python and their
top-level files aren’t located in the PyTools
directory:
try: import find except ImportError: from PP3E.PyTools import find
The find
module may be
gone, but it need not be forgotten.
[*] In fact, glob
just
uses the standard fnmatch
module to match name patterns; see the fnmatch
description later in this
chapter for more details.
[*] Unlike the re
module,
fnmatch
supports only common
Unix shell matching operators, not full-blown regular expression
patterns; to understand why this matters, see Chapter 18 for more
details.
Get Programming Python, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.