Retrieving a Line at Random from a File of Unknown Size
Credit: Richard Papworth
Problem
You have a file of unknown (but potentially very large) size, and you need to select one line at random from it, with a single pass on the file.
Solution
We do need to read the whole file, but we don’t have to read it all at once:
import random
def randomLine(file_object):
"Retrieve a random line from a file, reading through the file once"
lineNum = 0
selected_line = ''
while 1:
aLine = file_object.readline( )
if not aLine: break
lineNum = lineNum + 1
# How likely is it that this is the last line of the file?
if random.uniform(0,lineNum)<1:
selected_line = aLine
file_object.close( )
return selected_lineDiscussion
Of course, a more obvious approach would be:
random.choice(file_object.readlines( ))
But that requires reading the whole file into memory at once, which can be a problem for truly enormous files.
This recipe works thanks to an unobvious but not terribly deep
theorem: when we have seen the first N lines in
the file, there is a probability of exactly 1/N
that each of them is the one selected so far (i.e., the one to which
the selected_line variable refers at this point).
This is easy to see for the last line (the one just read into the
aLine variable), because we explicitly choose it
with a probability of 1.0/lineNum. The general
validity of the theorem follows by induction. If it was true after
the first N-1 lines were read,
it’s clearly still true after the
Nth one is read. As we select ...