External files are at the heart of much of what we do with shell utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It’s a core programming concept.
In Python, the built-in open
function is the primary tool scripts use to access the files on the
underlying computer system. Since this function is an inherent part of
the Python language, you may already be familiar with its basic
workings. Technically, open
gives
direct access to the stdio
filesystem calls in the system’s C library—it returns a new file
object that is connected to the external file and has methods that map
more or less directly to file calls on your machine. The open
function also provides a portable
interface to the underlying filesystem—it works the same way on every
platform on which Python runs.
Other file-related interfaces in Python allow us to do things
such as manipulate lower-level descriptor-based files (os
module), store objects away in files by
key (anydbm
and shelve
modules), and access SQL databases.
Most of these are larger topics addressed in Chapter 19.
In this chapter, we’ll take a brief tutorial look at the
built-in file object and explore a handful of more advanced
file-related topics. As usual, you should consult the library manual’s
file object entry for further details and methods we don’t have space
to cover here. Remember, for quick interactive help, you can also run
dir(file)
for an attributes list
with methods, help(file)
for
general help, and help(file.read)
for help on a specific method such as read
. The built-in name file
identifies
the file datatype in recent Python releases.[*]
For most purposes, the open
function is all you need to remember
to process files in your scripts. The file object returned by
open
has methods for reading data
(read
, readline
, readlines
), writing data (write
, writelines
), freeing system resources
(close
), moving about in the file
(seek
), forcing data to be
transferred out of buffers (flush
), fetching the underlying file
handle (fileno
), and more. Since
the built-in file object is so easy to use, though, let’s jump right
into a few interactive examples.
To make a new file, call open
with two arguments: the external
name of the file to be created and a
mode string w
(short for
write). To store data on the file, call the
file object’s write
method with
a string containing the data to store, and then call the close
method to close the file if you
wish to open it again within the same program or session:
C:\temp>python
>>>file = open('data.txt', 'w')
# open output file object: creates >>>file.write('Hello file world!\n')
# writes strings verbatim >>>file.write('Bye file world.\n')
>>>file.close( )
# closed on gc and exit too
And that’s it—you’ve just generated a brand-new text file on your computer, regardless of the computer on which you type this code:
C:\temp>dir data.txt /B
data.txt C:\temp>type data.txt
Hello file world! Bye file world.
There is nothing unusual about the new file; here, I use the
DOS dir
and type
commands to list and display the
new file, but it shows up in a file explorer GUI too.
In the open
function
call shown in the preceding example, the first argument can
optionally specify a complete directory path as part of the
filename string. If we pass just a simple filename without a
path, the file will appear in Python’s current working
directory. That is, it shows up in the place where the code is
run. Here, the directory C:\temp on my
machine is implied by the bare filename
data.txt, so this actually creates a file
at C:\temp\data.txt. More accurately, the
filename is relative to the current working directory if it does
not include a complete absolute directory path. See the section
"Current Working
Directory,” in Chapter
3, for a refresher on this topic.
Also note that when opening in w
mode, Python either creates the
external file if it does not yet exist or erases the file’s
current contents if it is already present on your machine (so be
careful out there—you’ll delete whatever was in the file
before).
Notice that we added an explicit \n
end-of-line character to lines
written to the file; unlike the print
statement, file write
methods write exactly what they
are passed without adding any extra formatting. The string
passed to write
shows up byte
for byte on the external file.
Output files also sport a writelines
method, which simply writes
all of the strings in a list one at a time without adding any
extra formatting. For example, here is a writelines
equivalent to the two
write
calls shown
earlier:
file.writelines(['Hello file world!\n', 'Bye file world.\n'])
This call isn’t as commonly used (and can be emulated with
a simple for
loop), but it is
convenient in scripts that save output in a list to be written
later.
The file close
method
used earlier finalizes file contents and frees up system
resources. For instance, closing forces buffered output data to
be flushed out to disk. Normally, files are automatically closed
when the file object is garbage collected by the interpreter
(i.e., when it is no longer referenced) and when the Python
session or program exits. Because of that, close
calls are often optional. In
fact, it’s common to see file-processing code in Python like
this:
open('somefile.txt', 'w').write("G'day Bruce\n")
Since this expression makes a temporary file object,
writes to it immediately, and does not save a reference to it,
the file object is reclaimed and closed right away without ever
having called the close
method explicitly.
Tip
But note that this auto-close on
reclaim file feature may change in future Python
releases. Moreover, the Jython Java-based Python
implementation discussed later does not reclaim files as
immediately as the standard Python system (it uses Java’s
garbage collector). If your script makes many files and your
platform limits the number of open files per program, explicit
close
calls are a robust
habit to form.
Also note that some IDEs, such as Python’s standard IDLE GUI, may hold on to your file objects longer than you expect, and thus prevent them from being garbage collected. If you write to an output file in IDLE, be sure to explicitly close (or flush) your file if you need to read it back in the same IDLE session. Otherwise, output buffers won’t be flushed to disk and your file may be incomplete when read.
Reading data from external files is just as easy as
writing, but there are more methods that let us load data in a
variety of modes. Input text files are opened with either a mode
flag of r
(for “read”) or no
mode flag at all—it defaults to r
if omitted, and it commonly is. Once
opened, we can read the lines of a text file with the readlines
method:
>>>file = open('data.txt', 'r')
# open input file object >>>for line in file.readlines( ):
# read into line string list ...print line,
# lines have '\n' at end ... Hello file world! Bye file world.
The readlines
method
loads the entire contents of the file into memory and gives it to
our scripts as a list of line strings that we can step through in
a loop. In fact, there are many ways to read an input file:
file.read( )
Returns a string containing all the bytes stored in the file
file.read(N)
Returns a string containing the next N bytes from the file
file.readline( )
Reads through the next
\n
and returns a line stringfile.readlines( )
Reads the entire file and returns a list of line strings
Let’s run these method calls to read files, lines, and bytes
(more on the seek
call, used
here to rewind the file, in a moment):
>>>file.seek(0)
# go back to the front of file >>>file.read( )
# read entire file into string 'Hello file world!\nBye file world.\n' >>>file.seek(0)
>>>file.readlines( )
['Hello file world!\n', 'Bye file world.\n'] >>>file.seek(0)
>>>file.readline( )
# read one line at a time 'Hello file world!\n' >>>file.readline( )
'Bye file world.\n' >>>file.readline( )
# empty string at end-of-file '' >>>file.seek(0)
>>>file.read(1), file.read(8)
('H', 'ello fil')
All of these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:
read( )
andreadlines( )
load the entire file into memory all at once. That makes them handy for grabbing a file’s contents with as little code as possible. It also makes them very fast, but costly for huge files—loading a multigigabyte file into memory is not generally a good thing to do.On the other hand, because the
readline( )
andread(N)
calls fetch just part of the file (the next line, or N-byte block), they are safer for potentially big files but a bit less convenient and usually much slower. Both return an empty string when they reach end-of-file. If speed matters and your files aren’t huge,read
orreadlines
may be a better choice.See also the discussion of the newer file iterators in the next section. Iterators provide the convenience of
readlines( )
with the space efficiency ofreadline( )
.
By the way, the seek(0)
call used repeatedly here means “go back to the start of the
file.” In our example, it is an alternative to reopening the file
each time. In files, all read and write operations take place at
the current position; files normally start at offset 0 when opened
and advance as data is transferred. The seek
call simply lets us move to a new
position for the next transfer operation.
Python’s seek
method also
accepts an optional second argument that has one of three values—0
for absolute file positioning (the default), 1 to seek relative to
the current position, and 2 to seek relative to the file’s end.
When seek
is passed only an
offset argument of 0, as shown earlier, it’s roughly a file
rewind operation.
The traditional way to read a file line by line that you saw in the prior section:
>>>file = open('data.txt')
# open input file object >>>for line in file.readlines( ):
# read into line string list ...print line,
is actually more work than is needed today. In recent
Pythons, the file object includes an iterator which is smart
enough to grab just one more line per request in iteration
contexts such as for
loops and
list comprehensions. Iterators are simply objects with next
methods. The practical benefit of
this extension is that you no longer need to call .readlines
in a for
loop to scan line by line; the
iterator reads lines on request:
>>>file = open('data.txt')
>>>for line in file:
# no need to call readlines ...print line,
# iterator reads next line each time ... Hello file world! Bye file world. >>>for line in open('data.txt'):
# even shorter: temporary file object ...print line,
... Hello file world! Bye file world.
Moreover, the iterator form does not load the entire file
into a line’s list all at once, so it will be more space efficient
for large text files. Because of that, this is the prescribed way
to read line by line today; when in doubt, let Python do your work
automatically. If you want to see what really happens inside the
for
loop, you can use the
iterator manually; it’s similar to calling the readline
method each time through, but
read methods return an empty string at end-of-file (EOF
), whereas the iterator raises an
exception to end the iteration:
>>>file = open('data.txt')
# read methods: empty at EOF >>>file.readline( )
'Hello file world!\n' >>>file.readline( )
'Bye file world.\n' >>>file.readline( )
'' >>>file = open('data.txt')
# iterators: exception at EOF >>>file.next( )
'Hello file world!\n' >>>file.next( )
'Bye file world.\n' >>>file.next( )
Traceback (most recent call last): File "<stdin>", line 1, in ? StopIteration
Interestingly, iterators are automatically used in all
iteration contexts, including the list
constructor call, list
comprehension expressions, map
calls, and in
membership
checks:
>>>open('data.txt').readlines( )
['Hello file world!\n', 'Bye file world.\n'] >>>list(open('data.txt'))
['Hello file world!\n', 'Bye file world.\n'] >>>lines = [line.rstrip( ) for line in open('data.txt')]
# or [:-1] >>>lines
['Hello file world!', 'Bye file world.'] >>>lines = [line.upper( ) for line in open('data.txt')]
>>>lines
['HELLO FILE WORLD!\n', 'BYE FILE WORLD.\n'] >>>map(str.split, open('data.txt'))
[['Hello', 'file', 'world!'], ['Bye', 'file', 'world.']] >>>line = 'Hello file world!\n'
>>>line in open('data.txt')
True
Iterators may seem somewhat implicit at first glance, but they represent the ways that Python makes developers’ lives easier over time.[*]
Besides w
and r
, most platforms support an a
open mode string, meaning “append.” In
this output mode, write
methods
add data to the end of the file, and the open
call will not erase the current
contents of the file:
>>>file = open('data.txt', 'a')
# open in append mode: doesn't erase >>>file.write('The Life of Brian')
# added at end of existing data >>>file.close( )
>>> >>>open('data.txt').read( )
# open and read entire file 'Hello file world!\nBye file world.\nThe Life of Brian'
Most files are opened using the sorts of calls we just ran,
but open
actually allows up to
three arguments for more specific processing needs—the filename,
the open mode, and a buffer size. All but the first of these are
optional: if omitted, the open mode argument defaults to r
(input), and the buffer size policy is
to enable buffering on most platforms. Here are a few things you
should know about all three open
arguments:
- Filename
As mentioned earlier, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described in the prior chapter). In general, any filename form you can type in your system shell will work in an
open
call. For instance, a filename argumentr'..\temp\spam.txt'
on Windows means spam.txt in the temp subdirectory of the current working directory’s parent—up one, and down to directory temp.- Open mode
The
open
function accepts other modes too, some of which are not demonstrated in this book (e.g.,r+
,w+
, anda+
to open for updating, and any mode string with ab
to designate binary mode). For instance, moder+
means both reads and writes are allowed on an existing file;w+
allows reads and writes but creates the file anew, erasing any prior content; andwb
writes data in binary mode (more on this in the next section). Generally, whatever you could use as a mode string in the C language’sfopen
call on your platform will work in the Pythonopen
function, since it really just callsfopen
internally. (If you don’t know C, don’t sweat this point.) Notice that the contents of files are always strings in Python programs, regardless of mode: read methods return a string, and we pass a string to write methods.- Buffer size
The
open
call also takes an optional third buffer size argument, which lets you controlstdio
buffering for the file—the way that data is queued up before being transferred to boost performance. If passed, 0 means file operations are unbuffered (data is transferred immediately), 1 means they are line buffered, any other positive value means to use a buffer of approximately that size, and a negative value means to use the system default (which you get if no third argument is passed and which generally means buffering is enabled). The buffer size argument works on most platforms, but it is currently ignored on platforms that don’t provide thesevbuf
system call.
All of the preceding examples process simple text
files. Python scripts can also open and process files containing
binary data—JPEG images, audio clips, packed
binary data produced by FORTRAN and C programs, and anything else
that can be stored in files. The primary difference in terms of
your code is the mode argument passed to the
built-in open
function:
>>>file = open('data.txt', 'wb')
# open binary output file >>>file = open('data.txt', 'rb')
# open binary input file
Once you’ve opened binary files in this way, you may read
and write their contents using the same methods just illustrated:
read
, write
, and so on. (readline
and readlines
don’t make sense here, though:
binary data isn’t line oriented.)
In all cases, data transferred between files and your
programs is represented as Python strings
within scripts, even if it is binary data. This works because
Python string objects can always contain character bytes of any
value (though some may look odd if printed). Interestingly, even a
byte of value zero can be embedded in a Python string; it’s called
\0
in escape-code notation and
does not terminate strings in Python as it typically does in C.
For instance:
>>>data = 'a\0b\0c'
>>>data
'a\x00b\x00c' >>>len(data)
5
Instead of relying on a terminator character, Python keeps
track of a string’s length explicitly. Here, data
references a string of length 5
that happens to contain two zero-value bytes; they print in
hexadecimal escape sequence form as \x00
(Python uses escapes to display all
nonprintable characters). Because no character codes are reserved,
it’s OK to read binary data with zero bytes (and other values)
into a string in Python.
Strictly speaking, on some platforms you may not
need the b
at the end of the
open mode argument to process binary files; the b
is simply ignored, so modes r
and w
work just as well. In fact, the
b
in mode flag strings is
usually required only for binary files on Windows. To understand
why, though, you need to know how lines are terminated in text
files.
For historical reasons, the end of a line of text in a file
is represented by different characters on different platforms:
it’s a single \n
character on
Unix and Linux, but the two-character sequence \r\n
on Windows.[*] That’s why files moved between Linux and Windows may
look odd in your text editor after transfer—they may still be
stored using the original platform’s end-of-line convention. For
example, most Windows editors handle text in Unix format, but
Notepad is a notable exception—text files copied from Unix or
Linux usually look like one long line when viewed in Notepad, with
strange characters inside (\n
).
Similarly, transferring a file from Windows to Unix in binary mode
retains the \r
characters
(which usually appear as ^M
in
text editors).
Python scripts don’t normally have to care, because the
Windows port (actually, the underlying C compiler on Windows)
automatically maps the DOS \r\n
sequence to a single \n
. It
works like this—when scripts are run on Windows:
For files opened in text mode,
\r\n
is translated to\n
when input.For files opened in text mode,
\n
is translated to\r\n
when output.For files opened in binary mode, no translation occurs on input or output.
On Unix-like platforms, no translations occur, regardless of open modes.
You should keep in mind two important consequences of all of
these rules. First, the end-of-line character is almost always
represented as a single \n
in
all Python scripts, regardless of how it is stored in external
files on the underlying platform. By mapping to and from \n
on input and output, the Windows port
hides the platform-specific difference.
The second consequence of the mapping is subtler: if you
mean to process binary data files on Windows,
you generally must be careful to open those files in binary mode
(rb
, wb
), not in text mode (r
, w
). Otherwise, the translations listed
previously could very well corrupt data as it is input or output.
It’s not impossible that binary data would by chance contain bytes
with values the same as the DOS end-of-line characters, \r
and \n
. If you process such binary files in
text mode on Windows, \r
bytes may be incorrectly discarded
when read and \n
bytes may be
erroneously expanded to \r\n
when written. The net effect is that your binary data will be
trashed when read and written—probably not quite what you want!
For example, on Windows:
>>>len('a\0b\rc\r\nd')
# 4 escape code bytes 8 >>>open('temp.bin', 'wb').write('a\0b\rc\r\nd')
# write binary data to file >>>open('temp.bin', 'rb').read( )
# intact if read as binary 'a\x00b\rc\r\nd' >>>open('temp.bin', 'r').read( )
# loses a \r in text mode! 'a\x00b\rc\nd' >>>open('temp.bin', 'w').write('a\0b\rc\r\nd')
# adds a \r in text mode! >>>open('temp.bin', 'rb').read( )
'a\x00b\rc\r\r\nd'
This is an issue only when running on Windows, but using
binary open modes rb
and
wb
for binary files everywhere
won’t hurt on other platforms and will help make your scripts more
portable (you never know when a Unix utility may wind up seeing
action on your Windows machine).
You may want to use binary file open modes at other times as
well. For instance, in Chapter
7, we’ll meet a script called fixeoln_one
that translates between DOS
and Unix end-of-line character conventions in text files. Such a
script also has to open text files in
binary mode to see what end-of-line
characters are truly present on the file; in text mode, they would
already be translated to \n
by
the time they reached the script.
By using the letter b in the
open
call, you can open binary
datafiles in a platform-neutral way and read and write their
content with normal file object methods. But how do you process
binary data once it has been read? It will be returned to your
script as a simple string of bytes, most of which are not
printable characters (that’s why Python displays them with
\xNN
hexadecimal escape
sequences).
If you just need to pass binary data along to another file
or program, your work is done. And if you just need to extract a
number of bytes from a specific position, string slicing will do
the job. To get at the deeper contents of binary data, though, as
well as to construct its contents, the standard library struct
module is more powerful.
The struct
module
provides calls to pack and unpack binary data, as though the data
was laid out in a C-language struct
declaration. It is also capable
of composing and decomposing using any endian-ness you desire
(endian-ness determines whether the most significant bits are on
the left or on the right). Building a binary datafile, for
instance, is straightforward: pack Python values into a string and
write them to a file. The format string here in the pack
call means big-endian (>
), with an integer, four-character
string, half integer, and float:
>>>import struct
>>>data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>>data
'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6' >>>file = open('data.bin', 'wb')
>>>file.write(data)
>>>file.close( )
As usual, Python displays here most of the packed binary
data’s bytes with \xNN
hexadecimal escape sequences, because the bytes are not printable
characters. To parse data like that which we just produced, read
it off the file and pass it to the struct
module with the same format
string; you get back a tuple containing the values parsed out of
the string and converted to Python objects:
>>>import struct
>>>file = open('data.bin', 'rb')
>>>bytes = file.read( )
>>>values = struct.unpack('>i4shf', data)
>>>values
(2, 'spam', 3, 1.2339999675750732)
For more details, see the struct
module’s entry in the Python
library manual. Also note that slicing comes in handy in this
domain; to grab just the four-character string in the middle of
the packed binary data we just read, we can simply slice it out.
Numeric values could similarly be sliced out and then passed to
struct.unpack
for
conversion:
>>>bytes
'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6' >>>string = bytes[4:8]
>>>string
'spam' >>>number = bytes[8:10]
>>>number
'\x00\x03' >>>struct.unpack('>h', number)
(3,)
The os
module
contains an additional set of file-processing functions that are
distinct from the built-in file object tools
demonstrated in previous examples. For instance, here is a very
partial list of os
file-related
calls:
os.open(
path, flags, mode
)
Opens a file and returns its descriptor
os.read(
descriptor, N
)
Reads at most
N
bytes and returns a stringos.write(
descriptor, string
)
Writes bytes in
string
to the fileos.lseek(
descriptor, position
)
Moves to
position
in the file
Technically, os
calls
process files by their descriptors, which are integer codes or “handles” that identify
files in the operating system. Because the descriptor-based file
tools in os
are lower
level and more complex than the built-in file objects created with
the built-in open
function, you
should generally use the latter for all but very special
file-processing needs.[*]
To give you the general flavor of this tool set, though, let’s
run a few interactive experiments. Although built-in file objects
and os
module descriptor files
are processed with distinct tool sets, they are in fact related—the
stdio
filesystem used by file
objects simply adds a layer of logic on top of descriptor-based
files.
In fact, the fileno
file
object method returns the integer descriptor associated with a
built-in file object. For instance, the standard stream file objects
have descriptors 0, 1, and 2; calling the os.write
function to send data to stdout
by descriptor has the same effect
as calling the sys.stdout.write
method:
>>>import sys
>>>for stream in (sys.stdin, sys.stdout, sys.stderr):
...print stream.fileno( ),
... 0 1 2 >>>sys.stdout.write('Hello stdio world\n')
# write via file method Hello stdio world >>>import os
>>>os.write(1, 'Hello descriptor world\n')
# write via os module Hello descriptor world 23
Because file objects we open explicitly behave the same way,
it’s also possible to process a given real external file on the
underlying computer through the built-in open
function, tools in the os
module, or both:
>>>file = open(r'C:\temp\spam.txt', 'w')
# create external file >>>file.write('Hello stdio file\n')
# write via file method >>> >>>fd = file.fileno( )
>>>print fd
3 >>>os.write(fd, 'Hello descriptor file\n')
# write via os module 22 >>>file.close( )
>>> C:\WINDOWS>type c:\temp\spam.txt
# both writes show up Hello descriptor file Hello stdio file
So why the extra file tools in os
? In short, they give more low-level
control over file processing. The built-in open
function is easy to use but is
limited by the underlying stdio
filesystem that it wraps; buffering, open modes, and so on, are all per-stdio
defaults.[*] The os
module
lets scripts be more specific—for example, the following opens a
descriptor-based file in read-write and binary modes by performing
a binary “or” on two mode flags exported by os
:
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>os.read(fdfile, 20)
'Hello descriptor fil' >>>os.lseek(fdfile, 0, 0)
# go back to start of file 0 >>>os.read(fdfile, 100)
# binary mode retains "\r\n" 'Hello descriptor file\r\nHello stdio file\r\n' >>>os.lseek(fdfile, 0, 0)
0 >>>os.write(fdfile, 'HELLO')
# overwrite first 5 bytes 5
On some systems, such open flags let us specify more
advanced things like exclusive access
(O_EXCL
) and
nonblocking modes (O_NONBLOCK
) when a file is opened. Some
of these flags are not portable across platforms (another reason
to use built-in file objects most of the time); see the library
manual or run a dir(os)
call on
your machine for an exhaustive list of other open flags
available.
We saw earlier how to go from file object to field
descriptor with the fileno
file
method; we can also go the other way—the os.fdopen
call wraps a file descriptor
in a file object. Because conversions work both ways, we can
generally use either tool set—file object or os
module:
>>>objfile = os.fdopen(fdfile)
>>>objfile.seek(0)
>>>objfile.read( )
'HELLO descriptor file\r\nHello stdio file\r\n'
Tip
Using os.open
with the
O_EXCL
flag is the most
portable way to lock files for concurrent updates or other
process synchronization in Python today. Another module,
fcntl
, also provides
file-locking tools but is not as widely available across
platforms. As of this writing, locking with os.open
is supported in Windows, Unix,
and Macintosh; fcntl
works
only on Unix.
The os
module also
includes an assortment of file tools that accept a file pathname
string and accomplish file-related tasks such as renaming
(os.rename
), deleting (os.remove
), and changing the file’s
owner and permission settings (os.chown
, os.chmod
). Let’s step through a few
examples of these tools in action:
>>>os.chmod('spam.txt', 0777)
# enabled all accesses
This os.chmod
file
permissions call passes a 9-bit string composed of three sets of
three bits each. From left to right, the three sets represent the
file’s owning user, the file’s group, and all others. Within each
set, the three bits reflect read, write, and execute access
permissions. When a bit is “1” in this string, it means that the
corresponding operation is allowed for the assessor. For instance,
octal 0777 is a string of nine “1” bits in binary, so it enables
all three kinds of accesses for all three user groups; octal 0600
means that the file can be read and written only by the user that
owns it (when written in binary, 0600 octal is really bits 110 000
000).
This scheme stems from Unix file permission settings, but it
works on Windows as well. If it’s puzzling, either check a Unix
manpage for chmod or see the fixreadonly
example in Chapter 7 for a practical
application (it makes read-only files that are copied off a CD-ROM
writable).
>>>os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')
# (from, to) >>> >>>os.remove(r'C:\temp\spam.txt')
# delete file Traceback (innermost last): File "<stdin>", line 1, in ? OSError: [Errno 2] No such file or directory: 'C:\\temp\\spam.txt' >>> >>>os.remove(r'C:\temp\eggs.txt')
The os.rename
call used
here changes a file’s name; the os.remove
file deletion call deletes a
file from your system and is synonymous with os.unlink
(the latter reflects the
call’s name on Unix but was obscure to users of other platforms).
The os
module also exports the
stat
system call:
>>>import os
>>>info = os.stat(r'C:\temp\spam.txt')
>>>info
(33206, 0, 2, 1, 0, 0, 41, 968133600, 968176258, 968176193) >>>import stat
>>>info[stat.ST_MODE], info[stat.ST_SIZE]
(33206, 41) >>>mode = info[stat.ST_MODE]
>>>stat.S_ISDIR(mode), stat.S_ISREG(mode)
(0, 1)
The os.stat
call returns
a tuple of values giving low-level information about the named
file, and the stat
module
exports constants and functions for querying this information in a
portable way. For instance, indexing an os.stat
result on offset stat.ST_SIZE
returns the file’s size,
and calling stat.S_ISDIR
with
the mode item from an os.stat
result checks whether the file is a directory. As shown earlier,
though, both of these operations are available in the os.path
module too, so it’s rarely
necessary to use os.stat
except
for low-level file queries:
>>>path = r'C:\temp\spam.txt'
>>>os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(0, 1, 41)
Unlike some shell-tool languages, Python doesn’t have an implicit file-scanning loop procedure, but it’s simple to write a general one that we can reuse for all time. The module in Example 4-1 defines a general file-scanning routine, which simply applies a passed-in Python function to each line in an external file.
Example 4-1. PP3E\System\Filetools\scanfile.py
def scanner(name, function): file = open(name, 'r') # create a file object while 1: line = file.readline( ) # call file methods if not line: break # until end-of-file function(line) # call a function object file.close( )
The scanner
function
doesn’t care what line-processing function is passed in, and that
accounts for most of its generality—it is happy to apply
any single-argument function that exists now or
in the future to all of the lines in a text file. If we code this
module and put it in a directory on PYTHONPATH
, we can use it any time we need
to step through a file line by line. Example 4-2 is a client script
that does simple line translations.
Example 4-2. PP3E\System\Filetools\commands.py
#!/usr/local/bin/python from sys import argv from scanfile import scanner class UnknownCommand(Exception): pass def processLine(line): # define a function if line[0] == '*': # applied to each line print "Ms.", line[1:-1] elif line[0] == '+': print "Mr.", line[1:-1] # strip first and last char: \n else: raise UnknownCommand, line # raise an exception filename = 'data.txt' if len(argv) == 2: filename = argv[1] # allow filename cmd arg scanner(filename, processLine) # start the scanner
The text file hillbillies.txt contains the following lines:
*Granny +Jethro *Elly May +"Uncle Jed"
and our commands script could be run as follows:
C:\...\PP3E\System\Filetools>python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"
Notice that we could also code the command processor in the
following way; especially if the number of command options starts to
become large, such a data-driven approach may be more concise and
easier to maintain than a large if
statement with essentially redundant
actions (if you ever have to change the way output lines print,
you’ll have to change it in only one place with this form):
commands = {'*': 'Ms.', '+': 'Mr.'} # data is easier to expand than code? def processLine(line): try: print commands[line[0]], line[1:-1] except KeyError: raise UnknownCommand, line
As a rule of thumb, we can also usually speed things up by
shifting processing from Python code to built-in tools. For
instance, if we’re concerned with speed (and memory space isn’t
tight), we can make our file scanner faster by using the readlines
method to load the file into a
list all at once instead of using the manual readline
loop in Example 4-1:
def scanner(name, function): file = open(name, 'r') # create a file object for line in file.readlines( ): # get all lines at once function(line) # call a function object file.close( )
A file iterator will do the same work but will not load the entire file into memory all at once:
def scanner(name, function): for line in open(name, 'r'): # scan line by line function(line) # call a function object file.close( )
And if we have a list of lines, we can work more magic with
the map
built-in function or list
comprehension expression. Here are two minimalist’s versions; the
for
loop is replaced by map
or a comprehension, and we let Python
close the file for us when it is garbage collected or the script
exits (both of these build a temporary list of results along the
way, which is likely trivial for all but the largest of
files):
def scanner(name, function): map(function, open(name, 'r')) def scanner(name, function): [function(line) for line in open(name, 'r')]
But what if we also want to change a file while scanning it? Example 4-3 shows two approaches: one uses explicit files, and the other uses the standard input/output streams to allow for redirection on the command line.
Example 4-3. PP3E\System\Filetools\filters.py
def filter_files(name, function): # filter file through function input = open(name, 'r') # create file objects output = open(name + '.out', 'w') # explicit output file too for line in input: output.write(function(line)) # write the modified line input.close( ) output.close( ) # output has a '.out' suffix def filter_stream(function): import sys # no explicit files while 1: # use standard streams line = sys.stdin.readline() # or: raw_input( ) if not line: break print function(line), # or: sys.stdout.write( ) if _ _name_ _ == '_ _main_ _': filter_stream(lambda line: line) # copy stdin to stdout if run
Since the standard streams are preopened for us, they’re often
easier to use. This module is more useful when imported as a library
(clients provide the line-processing function); when run standalone
it simply parrots stdin
to
stdout
:
C:\...\PP3E\System\Filetools>python filters.py < ..\System.txt
This directory contains operating system interface examples.
Many of the examples in this unit appear elsewhere in the examples
distribution tree, because they are actually used to manage other
programs. See the README.txt files in the subdirectories here
for pointers.
Tip
Brutally observant readers may notice that this last file is
named filters.py (with an
s), not filter.py. I
originally named it the latter but changed its name when I
realized that a simple import of the filename (e.g., “import
filter”) assigns the module to a local name “filter,” thereby
hiding the built-in filter
function. This is a built-in functional programming tool that is
not used very often in typical scripts. And as mentioned earlier,
redefining built-in names this way is not an issue unless you
really need to use the built-in version of the name. But as a
general rule of thumb, be careful to avoid picking built-in names
for module files. I will if you will.
[*] Technically, you can use the name file
anywhere you use open
, though open
is still the generally preferred
call unless you are subclassing to customize files. We’ll use
open
in most of our examples.
As for all built-in names, it’s OK to use the name file
for your own variables as long as
you don’t need direct access to the built-in file datatype (your
file
name will hide the
built-in scope’s file
). In
fact, this is such a common practice that we’ll frequently follow
it here. This is not a sin, but you should generally be careful
about reusing built-in names in this way.
[*] This is so useful that I was able to remove an entire
section from this chapter in this edition, which wrapped a
file object in a class to allow iteration over lines in a
for
loop. In fact, that
example became completely superfluous and no longer worked as
described after the second edition of this book. Technically,
its _ _getitem_ _
indexing
overload method was never called anymore because for
loops now look for a file
object’s _ _iter_ _
iteration method first. You don’t have to know what that
means, because iteration is a core feature of file objects
today.
[*] Actually, it gets worse: on the classic Mac, lines in
text files are terminated with a single \r
(not \n
or \r\n
). The more modern Mac is a
Unix-based machine and normally follows that platform’s
conventions instead. Whoever said proprietary software was
good for the consumer probably wasn’t speaking about users of
multiple platforms, and certainly wasn’t talking about
programmers.
[*] For instance, to process pipes,
described in Chapter 5. The
Python pipe call returns two file descriptors, which can be
processed with os
module
tools or wrapped in a file object with os.fdopen
.
[*] To be fair to the built-in file object, the open
function accepts an rb+
mode, which is equivalent to the
combined mode flags used here and can also be made nonbuffered
with a buffer size argument. Whenever possible, use open
, not os.open
.
Get Programming Python, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.