Processing Every Word in a File
Credit: Luther Blissett
Problem
You need to do something to every word in a
file, similar to the foreach function of
csh.
Solution
This is best handled by two nested loops, one on lines and one on the words in each line:
for line in open(thefilepath).xreadlines( ):
for word in line.split( ):
dosomethingwith(word)This implicitly defines words as sequences of nonspaces separated by
sequences of spaces (just as the Unix program wc
does). For other definitions of words, you can use regular
expressions. For example:
import re
re_word = re.compile(r'[\w-]+')
for line in open(thefilepath).xreadlines( ):
for word in re_word.findall(line):
dosomethingwith(word)In this case, a word is defined as a maximal sequence of alphanumerics and hyphens.
Discussion
For other definitions of words you will obviously need different
regular expressions. The outer loop, on all lines in the file, can of
course be done in many ways. The xreadlines method
is good, but you can also use the list obtained by the
readlines method, the standard library module
fileinput, or, in Python 2.2, even just:
for line in open(thefilepath):
which is simplest and fastest.
In Python 2.2, it’s often a good idea to wrap iterations as iterator objects, most commonly by simple generators:
from _ _future_ _ import generators
def words_of_file(thefilepath):
for line in open(thefilepath):
for word in line.split( ):
yield word
for word in words_of_file(thefilepath):
dosomethingwith(word)This approach lets you ...