book

Python Cookbook

by Alex Martelli, David Ascher

July 2002

Intermediate to advanced

608 pages

15h 46m

English

O'Reilly Media, Inc.

Read now

Unlock full access

The Design of the Book

Content preview from Python Cookbook

Processing Every Word in a File

Credit: Luther Blissett

Problem

You need to do something to every word in a file, similar to the foreach function of csh.

Solution

This is best handled by two nested loops, one on lines and one on the words in each line:

for line in open(thefilepath).xreadlines(  ):
    for word in line.split(  ):
        dosomethingwith(word)

This implicitly defines words as sequences of nonspaces separated by sequences of spaces (just as the Unix program wc does). For other definitions of words, you can use regular expressions. For example:

import re
re_word = re.compile(r'[\w-]+')

for line in open(thefilepath).xreadlines(  ):
    for word in re_word.findall(line):
        dosomethingwith(word)

In this case, a word is defined as a maximal sequence of alphanumerics and hyphens.

Discussion

For other definitions of words you will obviously need different regular expressions. The outer loop, on all lines in the file, can of course be done in many ways. The xreadlines method is good, but you can also use the list obtained by the readlines method, the standard library module fileinput, or, in Python 2.2, even just:

for line in open(thefilepath):

which is simplest and fastest.

In Python 2.2, it’s often a good idea to wrap iterations as iterator objects, most commonly by simple generators:

from _ _future_ _ import generators

def words_of_file(thefilepath):
    for line in open(thefilepath):
        for word in line.split(  ):
            yield word

for word in words_of_file(thefilepath):
    dosomethingwith(word)

This approach lets you ...