Chapter 1. Scaling with Generators
This for
loop seems simple:
for
item
in
items
:
do_something_with
(
item
)
And yet, miracles hide here. As you probably know, going through a collection one element at a time is called iteration. Few understand how Python’s iteration system really works and appreciate how deep and well-thought-out it is. This chapter makes you one of those people. You gain the ability to write highly scalable Python applications, which can handle ever-larger data sets in performant, memory-efficient ways.
Iteration is also core to one of Python’s most powerful tools: the generator function. Generator functions are not just a convenient way to create useful iterators. They enable some exquisite patterns of code organization, in a way that—by their very nature—intrinsically encourage excellent coding habits.
This chapter is special, because understanding it threatens to make you a permanently better programmer in every language. Mastering Python generators tends to do that, because of the distinctions and insights you gain along the way. Let’s dive in.
Iteration in Python
Python has a built-in function called iter()
. When you pass it a
collection, you get back an iterator object:
>>>
numbers
=
[
7
,
4
,
11
,
3
]
>>>
iter
(
numbers
)
<list_iterator object at 0x10219dc50>
An iterator is an object producing the values in a sequence, one at a time:
>>>
numbers_iter
=
iter
(
numbers
)
>>>
for
num
in
numbers_iter
:
...
(
num
)
7
4
11
3
You don’t normally need to use iter()
. If you instead write for num
in numbers
, what Python effectively does under the hood is call
iter()
on that collection. This happens automatically. Whatever
object it gets back is used as the iterator for that for
loop:
# This...
for
num
in
numbers
:
(
num
)
# ... is effectively just like this:
numbers_iter
=
iter
(
numbers
)
for
num
in
numbers_iter
:
(
num
)
An iterator over a collection is a separate object, with its own
identity—which you can verify with id()
:
>>>
# id() returns a unique number for each object.
...
# Different objects will always have different IDs.
>>>
id
(
numbers
)
4330133896
>>>
id
(
numbers_iter
)
4330216640
How does iter()
actually get the iterator? It can do this in several
ways, but one relies on a magic method called __iter__()
. This is a
method any class (including yours) may define, and it is called with
no arguments. Each time, it produces a fresh new iterator object
object. Lists have this method, for example:
>>>
numbers
.
__iter__
<method-wrapper '__iter__' of list object at 0x10130e4c8>
>>>
numbers
.
__iter__
()
<list_iterator object at 0x1013180f0>
Python makes a distinction between objects
which are iterators, and objects which are iterable. We say an
object is iterable if and only if you can pass it to iter()
and
get back a ready-to-use iterator. If that object has an __iter__()
method, iter()
will call it to get the iterator. Python lists and
tuples are iterable. So are strings, which is why you can write for
char in my_str:
to iterate over my_str
’s characters. Any
container you might use in a for
loop is iterable.
A for
loop is the most common way to
step through a sequence. But sometimes your code needs to step through
in a finer-grained way. For this, use the built-in function
next()
. You normally call it with a single argument, which is an
iterator. Each time you call it, next(my_iterator)
fetches and
returns the next element:
>>>
names
=
[
"Tom"
,
"Shelly"
,
"Garth"
]
>>>
# Create a fresh iterator...
>>>
names_it
=
iter
(
names
)
>>>
next
(
names_it
)
'Tom'
>>>
next
(
names_it
)
'Shelly'
>>>
next
(
names_it
)
'Garth'
What happens if you call next(names_it)
again? next()
will raise a
special built-in exception, called StopIteration
:
>>>
next
(
names_it
)
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
StopIteration
Raising this specific exception is how an iterator signals that the
sequence is done. You rarely have to raise or catch this exception
yourself, though we’ll see some patterns later where it is useful to
do so. A good mental model for how a for
loop works is to imagine it
calling next()
each time through the loop, exiting when
StopIteration
gets raised.
When using next()
yourself, you can provide a second argument, for
the default value. If you do, next()
will return that instead of
raising StopIteration
at the end:
>>>
names
=
[
"Tom"
,
"Shelly"
,
"Garth"
]
>>>
new_names_it
=
iter
(
names
)
>>>
next
(
new_names_it
,
"Rick"
)
'Tom'
>>>
next
(
new_names_it
,
"Rick"
)
'Shelly'
>>>
next
(
new_names_it
,
"Rick"
)
'Garth'
>>>
next
(
new_names_it
,
"Rick"
)
'Rick'
>>>
next
(
new_names_it
)
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
StopIteration
>>>
next
(
new_names_it
,
"Jane"
)
'Jane'
Consider a different situation. What if you aren’t working with a simple sequence of numbers? What if you are calculating or reading or otherwise obtaining the sequence elements as you go along?
Let’s start with a simple example. Suppose you need to write a function creating a list of square numbers which will be processed by other code:
def
fetch_squares
(
max_root
):
squares
=
[]
for
n
in
range
(
max_root
):
squares
.
append
(
n
**
2
)
return
squares
MAX
=
5
for
square
in
fetch_squares
(
MAX
):
do_something_with
(
square
)
This works. But there are potential problems lurking here. Can you spot any?
Here’s one: what if MAX
is not 5, but 10,000,000? Or
10,000,000,000? Or more? Your memory footprint will be pointlessly
dreadful: the code here creates a massive list, uses it once, then
throws it away. On top of that, the consuming for
loop cannot even
start until the entire list of squares has been fully calculated.
If some poor human is using this program, they’ll wonder if it got
stuck.
Even worse: What if you are not doing arithmetic to get each element—which is fast and cheap—but making a truly expensive calculation? Or making an API call over the network? Or reading from a database? Your program will be sluggish, perhaps unresponsive, and might even crash with an out-of-memory error. Its users will think you’re a terrible programmer.
The solution is to create an iterator to start with, lazily computing each value only when needed. Then each cycle through the loop happens just in time.
So how do you do that? It turns out there are several ways. But the best way is called a generator function. And you’re going to love it!
Generator Functions
Python provides a tool called the generator function, which…well, it’s hard to describe everything it gives you in one sentence. Of its many talents, I’ll first focus on how it is a very useful shortcut for creating iterators.
A generator function looks a lot like a regular function. But instead
of saying return
, it uses a new and different keyword:
yield
. Here’s a simple example:
def
gen_nums
():
n
=
0
while
n
<
4
:
yield
n
n
+=
1
Use it in a for
loop like this:
>>>
for
num
in
gen_nums
():
...
(
num
)
0
1
2
3
Let’s go through and understand this. When you call gen_nums()
like
a function, it immediately returns a generator object:
>>>
sequence
=
gen_nums
()
>>>
type
(
sequence
)
<class 'generator'>
The generator function is
gen_nums()
—what we define and then call. A function is a generator
function if and only if it uses “yield” instead of “return”. The
generator object is what that generator function returns when called—sequence
, in this case.
Memorize This Fact
A generator function will always return a generator object. It can never return anything else.
This generator object is an iterator, which means you can iterate
through it using next()
or a for
loop:
>>>
sequence
=
gen_nums
()
>>>
next
(
sequence
)
0
>>>
next
(
sequence
)
1
>>>
next
(
sequence
)
2
>>>
next
(
sequence
)
3
>>>
next
(
sequence
)
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
StopIteration
>>>
# Or in a for loop:
...
for
num
in
gen_nums
():
...
(
num
)
0
1
2
3
The flow of code works like this: when next()
is called the first
time, or the for
loop first starts, the body of gen_nums()
starts
executing at the beginning, returning the value to the right of the
yield
.
Advancing next()
So far, this is much like a regular function. But the next time
next()
is called—or, equivalently, the next time through the for
loop—the function does not start at the beginning again. It starts
on the line after the yield
statement. Look at the source of
gen_nums()
again:
def
gen_nums
():
n
=
0
while
n
<
4
:
yield
n
n
+=
1
gen_nums()
is more general than a function or subroutine. It is
actually a coroutine. You see, a regular function can have several
exit points (otherwise known as return
statements). But it has only
one entry point: each time you call a function, it always starts on
the first line of the function body.
A coroutine is like a function, except it has several possible
entry points. It starts with the first line, like a normal
function. But when it “returns”, the coroutine is not exiting, so much
as pausing. Subsequent calls with next()
—or equivalently, the
next time through the for
loop—start at that yield
statement
again, right where it left off; the reentry point is the line after
the yield
statement.
And that’s the key: Each yield
statement simultaneously defines an exit
point, AND a reentry point.
For generator objects, each time a new value is requested,
the flow of control picks up on the line after the yield
statement. In this case, the next line increments the variable n
,
then continues with the while
loop.
Notice we do not raise StopIteration
anywhere in the body of
gen_nums()
. When the function body finally exits—after it exits
the while
loop, in this case—the generator object automatically
raises StopIteration
.
Again: each yield
statement simultaneously defines an exit point
and a reentry point. In fact, you can have multiple yield
statements in a generator:
def
gen_extra_nums
():
n
=
0
while
n
<
4
:
yield
n
n
+=
1
yield
42
# Second yield
Here’s the output you will get when you use it:
>>>
for
num
in
gen_extra_nums
():
...
(
num
)
0
1
2
3
42
The second yield
is reached after the while
loop exits. When the
function reaches the implicit return at the end, the iteration
stops. Reason through the code above, and convince yourself it makes
sense.
Converting to a Generator Function
Let’s revisit the earlier example of cycling through a sequence of squares. This is how we first did it:
def
fetch_squares
(
max_root
):
squares
=
[]
for
num
in
range
(
max_root
):
squares
.
append
(
num
**
2
)
return
squares
MAX
=
5
for
square
in
fetch_squares
(
MAX
):
do_something_with
(
square
)
As an exercise, pause here, open up a new Python file, and see if you
can write a gen_squares()
generator function that accomplishes the
same thing.
Done? Great. Here’s what it looks like:
def
gen_squares
(
max_root
):
for
num
in
range
(
max_root
):
yield
num
**
2
>>>
MAX
=
5
>>>
for
square
in
gen_squares
(
MAX
):
...
(
square
)
0
1
4
9
16
Notice something important. gen_squares()
includes the built-in
range()
function. This returns an iterable
object. That is important.
Because imagine range()
returned a list. If that’s the case, and MAX
is huge, that creates a huge list inside your generator
function, completely destroying its scalability.
The larger point: Generator functions are only as scalable as their least scalable line of code. Generator functions potentially have a small memory footprint. But only if you code them intelligently. When writing generator functions, watch out for hidden bottlenecks like this.
Do You Need Generators?
Strictly speaking, we don’t need generator functions for iteration. We just want them, because they make useful patterns of scalability far easier.
For example: can you create an iterator without writing a generator function? Yes, you can. For creating a list of square numbers, you can do it like this:
class
SquaresIterator
:
def
__init__
(
self
,
max_root_value
):
self
.
max_root_value
=
max_root_value
self
.
current_root_value
=
0
def
__iter__
(
self
):
return
self
def
__next__
(
self
):
if
self
.
current_root_value
>=
self
.
max_root_value
:
raise
StopIteration
square_value
=
self
.
current_root_value
**
2
self
.
current_root_value
+=
1
return
square_value
# You can use it like this:
for
square
in
SquaresIterator
(
5
):
(
square
)
Each value is obtained by invoking its __next__()
method, until it
raises StopIteration
. This produces the same output; but take a look at the
source for the SquaresIterator
class and compare it to the source
for the generator above. Which is easier to read? Which is easier to
maintain? And when requirements change, which is easier to modify
without introducing errors? Most people find the generator solution
easier and more natural.
Authors often use the word "generator" by itself, to mean either the generator function, _or_ the generator object returned when you call it. Typically the writer thinks the intended meaning is obvious from the context. Sometimes it is, but often not. Sometimes the writer is not fully clear on the distinction to begin with. But it is an important distinction to get. Just as there is a big difference between a function and the value it returns when you call it, so is there a big difference between the generator function and the generator object it returns.
In your own thought and speech, I encourage you to only use the phrases “generator function” and “generator object”, so you are always clear inside yourself, and in your communication. (This also helps your teammates be more clear.) The only exception: when you truly mean “generator functions and objects”, referring to the language feature as a broad concept, then it’s okay to just say “generators”. I’ll lead by example in this book.
Generator Patterns and Scalable Composability
Here’s a little generator function:
def
matching_lines_from_file
(
path
,
pattern
):
with
open
(
path
)
as
handle
:
for
line
in
handle
:
if
pattern
in
line
:
yield
line
.
rstrip
(
'
\n
'
)
This function, matching_lines_from_file()
, demonstrates several
important practices for modern Python, and is worth studying. It does
simple substring matching on lines of a text file, yielding lines
containing that substring.
The first line opens a read-only file object, called handle
. If you
haven’t been opening your file objects using with
statements, start
today. The main benefit is that once the with
block is exited, the
file object is automatically closed—even if an exception causes a
premature exit. It’s similar to:
try
:
handle
=
open
(
path
)
# read from handle
finally
:
handle
.
close
()
(The try
/finally
is explained in Chapter 5.) Next we have for line in handle
. This useful idiom—which
not many people seem to know about—works in a particular way for text
files. With each iteration through the for
loop, a new line of text will
be read from the underlying text file, and placed in the line
variable.
Sometimes people foolishly take another approach, which I have to warn you about:
# Don't do this!!
for
line
in
handle
.
readlines
():
The readlines()
(plural) method reads in the entire file, parses
it into lines, and returns a list of strings—one string per line. By
now, you realize how this can destroy scalability.
It bears repeating: a generator function is only as scalable as its least scalable line. So code carefully, lest you create some memory bottleneck that renders the generator function pointless.
Another approach you will sometimes see, which is scalable, is to
use the file object’s .readline()
method (singular), which manually
returns lines one at a time:
# .readline() reads and returns a single line of text,
# or returns the empty string at end-of-file.
line
=
handle
.
readline
()
while
line
!=
''
:
# do something with line
# ...
# At the end of the loop, read the next line.
line
=
handle
.
readline
()
But simply writing for line in handle
is clearer and
easier.
After that, it’s straightforward: matching lines have any trailing
\n
-character stripped, and are yielded to the consumer. When writing
generator functions, ask yourself: “What is the maximum
memory footprint of this function, and how can I minimize it?” You
can think of scalability as inversely proportional to this
footprint. For matching_lines_from_file()
, it will be about equal to
the size of the longest line in the text file. So it is appropriate for
the typical human-readable text file, whose lines are small.
(It’s also possible to point it to, say, a ten-terabyte text file consisting of exactly one line. If you expect something like that, you’ll need a different approach.)
Text Lines to Dicts
Now, suppose a log file contains lines like these:
WARNING: Disk usage exceeding 85% DEBUG: User 'tinytim' upgraded to Pro version INFO: Sent email campaign, completed normally WARNING: Almost out of beer
Say you exercise matching_lines_from_file()
like so:
for line in matching_lines_from_file("log.txt", "WARNING:"): print(line)
That yields these records:
WARNING: Disk usage exceeding 85% WARNING: Almost out of beer
Suppose your application needs that data in dict form:
{"level": "WARNING", "message": "Disk usage exceeding 85%"} {"level": "DEBUG", "message": "User 'tinytim' upgraded to Pro version"}
We want to scalably transform the records from one form to another—from strings (lines of the log file) to Python dicts. So let’s make a new generator function to connect them:
def
parse_log_records
(
lines
):
for
line
in
lines
:
level
,
message
=
line
.
split
(
": "
,
1
)
yield
{
"level"
:
level
,
"message"
:
message
}
Now we can connect the two:
# log_lines is a generator object
log_lines
=
matching_lines_from_file
(
"log.txt"
,
"WARNING:"
)
for
record
in
parse_log_records
(
log_lines
):
# record is a dict
(
record
)
Of course, parse_log_records()
can be used on its own:
with
open
(
"log.txt"
)
as
handle
:
for
record
in
parse_log_records
(
handle
):
(
record
)
matching_lines_from_file()
and parse_log_records()
are like building
blocks. Properly designed, they can be used to build different data
processing streams. I call this scalable composability.
It goes beyond designing composable functions and types.
Ask yourself how you can make the components scalable, and whatever
is assembled out of them scalable too.
Composable Interfaces
Let’s discuss a particular design point. Both
matching_lines_from_file()
and parse_log_records()
produce an
iterator. (Or, more specifically, a generator
object.)1 But they have a discrepancy
on the input side: parse_log_records()
accepts an iterator, but
matching_lines_from_file()
requires a path to a file to read from.
This means matching_lines_from_file()
combines two functions:
reading lines of text from a file, then filtering those lines based on
some criteria.
Combining functions like this is often what you want in realistic
code. But when designing components to flexibly compose together,
inconsistent interfaces like this can be limiting. Let’s break up the
services in matching_lines_from_file()
into two generator functions:
def
lines_from_file
(
path
):
with
open
(
path
)
as
handle
:
for
line
in
handle
:
yield
line
.
rstrip
(
'
\n
'
)
def
matching_lines
(
lines
,
pattern
):
for
line
in
lines
:
if
pattern
in
line
:
yield
line
You can compose these like so:
lines
=
lines_from_file
(
"log.txt"
)
matching
=
matching_lines
(
lines
,
'WARNING:'
)
for
line
in
matching
:
(
line
)
Or even redefine matching_lines_from_file()
in terms of them:
def
matching_lines_from_file
(
pattern
,
path
):
lines
=
lines_from_file
(
path
)
matching
=
matching_lines
(
lines
,
pattern
)
for
line
in
matching
:
yield
line
Conceptually, this factored-out matching_lines
does a filtering
operation; all lines are read in, and a subset of them are
yielded. parse_log_records()
is different. One input record (a str
line) maps to exactly one output record (a dict
). Mathematically,
it’s a mapping operation. Think of it as a transformer or
adapter. lines_from_file()
is in a third category; instead of taking
a stream as input, it takes a completely different parameter. Since it
still returns an iterator of records, think of it as a source. And
a real program will eventually want to do something with that
stream, consuming it without producing another iterator; call that a
sink.
You need all these pieces to make a working program. When designing a chainable set of generator functions like this—think of it as a toolkit for constructing internal data pipelines—ask yourself whether each component is a sink or a source; whether it does filtering or mapping; or whether it’s some combination of these. Just asking yourself this question leads to a more usable, readable, and maintainable codebase. And if you’re making a library which others will use, you’re more likely to end up with a toolkit so powerfully flexible, people will use it to build programs you never imagined.
Fanning Out
I want you to notice something about parse_log_records()
. As I said,
it fits in the “mapping” category. And notice its mapping is one input
(line of text) to one output (dictionary). In other words, each record
in the input (a str
) becomes exactly one record in the output (a dict
).
That isn’t always the case. Sometimes, your generator function needs to consume several input records to create one output record. Or the opposite: one input record yields several output records.
Here’s an example of the latter. Imagine a text file containing lines in a poem:2
all night our room was outer-walled with rain drops fell and flattened on the tin roof and rang like little disks of metal
Let’s create a generator function, words_in_text()
, producing the
words one at a time. Here is a first approach:
# lines is an iterator of text file lines,
# e.g. returned by lines_from_file()
def
words_in_text
(
lines
):
for
line
in
lines
:
for
word
in
line
.
split
():
yield
word
This generator function3 takes a fan out approach. No input records are dropped, which means it doesn’t do any filtering; it’s still purely in the “mapping” category of generator functions. But the mapping isn’t one-to-one. Rather, one input record produces one or more output records. Run the following code:
poem_lines
=
lines_from_file
(
"poem.txt"
)
poem_words
=
words_in_text
(
poem_lines
)
for
word
in
poem_words
:
(
word
)
It will produce this output:
all night our room was outer-walled ...
That first input record—“all night our room was outer-walled with rain”—yields eight words (output records). Ignoring any blank lines in the poem, every line of prose will produce at least one word—probably several.
Fanning In
The idea of fanning out is interesting, but simple enough. It’s more complex when we go the opposite direction: fanning in. That means the generator function consumes more than one input record to produce each output record. Doing this successfully requires awareness of the input’s structure, and you’ll typically need to encode some simple parsing logic.
Imagine a text file containing residential house sales data. Each record is a set of key-value pairs, one pair per line, with records separated by blank lines:
address: 1423 99th Ave square_feet: 1705 price_usd: 340210 address: 24257 Pueblo Dr square_feet: 2305 price_usd: 170210 address: 127 Cochran square_feet: 2068 price_usd: 320500
To read this data into a form usable in our code, what we want is a
generator function—let’s name it house_records()
—which accepts
a sequence of strings (lines) and parses them into convenient dictionaries:
>>>
lines_of_house_data
=
lines_from_file
(
"housedata.txt"
)
>>>
houses
=
house_records
(
lines_of_house_data
)
>>>
# Fetch the first record.
...
house
=
next
(
houses
)
>>>
house
[
'address'
]
'1423 99th Ave'
>>>
house
=
next
(
houses
)
>>>
house
[
'address'
]
'24257 Pueblo Dr'
How would you create this? If practical, try it: pause here, open up a code editor, and see if you can implement it.
Okay, time’s up. Here is one approach:
def
house_records
(
lines
):
record
=
{}
for
line
in
lines
:
if
line
==
''
:
yield
record
record
=
{}
continue
key
,
value
=
line
.
split
(
': '
,
1
)
record
[
key
]
=
value
yield
record
Notice where the yield
keywords appear. The last line of the for
loop reads individual key-value pairs. An empty record
dictionary is populated with data until lines
produces an empty
line. That signals the current record is complete, so it’s
yield
-ed, and a new record dictionary created. The end of the
very last record in housedata.txt
is signaled not by an empty line,
but by the end of the file; that’s why we need the final yield
statement.
As defined, house_records()
is a bit clunky if we’re normally reading
from a text file. It makes sense to define a new generator function
accepting just the path to the file:
def
house_records_from_file
(
path
):
lines
=
lines_from_file
(
path
)
for
house
in
house_records
(
lines
):
yield
house
# Then in your program:
for
house
in
house_records_from_file
(
"housedata.txt"
):
(
house
[
"address"
])
You may have noticed many of these examples have a bit of
boilerplate when one generator function internally calls another.
The last two lines of house_records_from_file
say:
for
house
in
house_records
(
lines
):
yield
house
Python provides a shortcut to accomplish this in one line, with the
yield from
statement:
def
house_records_from_file
(
path
):
lines
=
lines_from_file
(
path
)
yield from
house_records
(
lines
)
Even though “yield from” is two words, semantically it’s a single
keyword, distinct from yield
. The yield from
statement is used
specifically in generator functions, when they yield values directly
from another generator object (or, equivalently, by calling another
generator function). Using it often simplifies your code, as you see
in house_records_from_file()
.
Going back a bit, here’s how it works with
matching_lines_from_file()
:
def
matching_lines_from_file
(
pattern
,
path
):
lines
=
lines_from_file
(
path
)
yield from
matching_lines
(
lines
,
pattern
)
The formal name for what yield from
does is “delegating to a
subgenerator”, which instills a deeper connection between the
containing and inner generator objects. In particular, generator
objects have certain methods—send()
, throw()
, and close()
—for
passing information back into the context of the running generator
function. I won’t cover them here, as they are currently not widely
used; you can learn more by reading PEPs
342 and
380. If you do use them,
yield from
becomes necessary to enable the flow of information back
into the scope of the running coroutine.
Python Is Filled with Iterators
Iteration has snuck into many places in Python. The built-in range()
returns an iterable:
>>>
seq
=
range
(
3
)
>>>
type
(
seq
)
<class 'range'>
>>>
for
n
in
seq
:
(
n
)
0
1
2
The built-in map
, filter
, zip
, and enumerate
functions all return
iterators:
>>>
numbers
=
[
1
,
2
,
3
]
>>>
big_numbers
=
[
100
,
200
,
300
]
>>>
def
double
(
n
):
...
return
2
*
n
>>>
def
is_even
(
n
):
...
return
n
%
2
==
0
>>>
mapped
=
map
(
double
,
numbers
)
>>>
mapped
<map object at 0x1013ac518>
>>>
for
num
in
mapped
:
(
num
)
2
4
6
>>>
filtered
=
filter
(
is_even
,
numbers
)
>>>
filtered
<filter object at 0x1013ac668>
>>>
for
num
in
filtered
:
(
num
)
2
>>>
zipped
=
zip
(
numbers
,
big_numbers
)
>>>
zipped
<zip object at 0x1013a9608>
>>>
for
pair
in
zipped
:
(
pair
)
(1, 100)
(2, 200)
(3, 300)
Notice that mapped
is something called a “map object”, rather than a
list of the results of the calculation; filtered
and zipped
are
similar. These are all iterators—giving you all the benefits of
iteration, built into the language.
The Iterator Protocol
This optional section explains Python’s iterator protocol in formal detail, giving you a precise and low-level understanding of how generators, iterators, and iterables all work. For the day-to-day coding of most programmers, it’s not nearly as important as everything else in this chapter. That said, you need this information to implement custom iterable collection types. Personally, I also find knowing the protocol helps me reason through iteration-related issues and edge cases; by knowing these details, I’m able to quickly troubleshoot and fix certain bugs that might otherwise eat up my afternoon.
If this all sounds valuable to you, keep reading; otherwise, feel free to skip to the next chapter. You can always come back to read it later.
As mentioned, Python makes a distinction between iterators, versus objects that are iterable. The difference is subtle to begin with, and frankly it doesn’t help that the two words sound nearly identical. Keep clear in your mind that iterator and iterable are distinct but related concepts, and the following will be easier to understand.
Informally, an iterator is something you can pass to next()
, or use
exactly once in a for
loop. More formally, an object in Python is
an iterator if it follows the iterator protocol. And an object follows
the iterator protocol if it meets the following criteria:
-
It defines a method named
__next__()
, called with no arguments. -
Each time
__next__()
is called, it produces the next item in the sequence. -
Until all items have been produced. Then, subsequent calls to
__next__()
raiseStopIteration
. -
It also defines a boilerplate method named
__iter__()
, called with no arguments, and returning the same iterator. Its body is literallyreturn self
.
Any object with these methods can call itself a Python iterator. You
are not intended to call the __next__()
method directly. Instead,
you will use the built-in next()
function.
To understand better, here is how you might write your own next()
function:
_NO_DEFAULT
=
object
()
def
next
(
it
,
default
=
_NO_DEFAULT
):
try
:
return
it
.
__next__
()
except
StopIteration
:
if
default
is
_NO_DEFAULT
:
raise
return
default
(As a side note, notice how this function creates a unique sentinel value,
_NO_DEFAULT
, rather than defaulting to a built-in value like None
. A
sentinel value is a value that exists solely for signaling in the
algorithm, and is meant to never overlap with a possible value for
real data. This way, you can pass any value to default
that you like
without conflict.)
Now, all the above is for the iterator. Let’s explain the other word, “iterable”. Informally, an
object is iterable if you can use it in a for
loop. More formally,
a Python object is iterable if it meets one of these two
criteria:
-
It defines a method called
__iter__()
, which creates and returns an iterator over the elements in the container; or -
It defines
__getitem__()
—the magic method for square brackets—and lets you referencefoo[0]
,foo[1]
, etc., raising anIndexError
once you go past the last element.4
(Notice “iterator” is a noun, while “iterable” is usually an adjective. This can help you remember which is which.)
When implementing your own container type, you probably want to make
it iterable, so you and others can use it in a for
loop. Depending
on the nature of the container, it’s often easiest to implement the
sequence protocol. As an example, consider this UniqueList
type, which is a kind of hybrid between a list and a set:
class
UniqueList
:
def
__init__
(
self
,
items
):
self
.
items
=
[]
for
item
in
items
:
self
.
append
(
item
)
def
append
(
self
,
item
):
if
item
not
in
self
.
items
:
self
.
items
.
append
(
item
)
def
__getitem__
(
self
,
index
):
return
self
.
items
[
index
]
Use it like this:
>>>
u
=
UniqueList
([
3
,
7
,
2
,
9
,
3
,
4
,
2
])
>>>
u
.
items
[3, 7, 2, 9, 4]
>>>
u
[
3
]
9
>>>
u
[
42
]
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
File
"<stdin>"
, line10
, in__getitem__
IndexError
:list index out of range
The __getitem__()
method implements square-bracket access;
basically, Python translates u[3]
into u.__getitem__(3)
. We’ve
programmed this object’s square brackets to operate much like a normal
list, in that the initial element is at index 0, and you get
subsequent elements with subsequent integers, not skipping any. And
when you go past the end, it raises IndexError
. If an object has a
__getitem__()
method behaving in just this way, iter()
knows how
to create an iterator over it:
>>>
u_iter
=
iter
(
u
)
>>>
type
(
u_iter
)
<class 'iterator'>
>>>
for
num
in
u_iter
:
...
(
num
)
3
7
2
9
4
Notice we get a lot of this behavior for free, simply because we’re
using an actual list internally (and thus delegating much of the
__getitem__()
logic to it). That’s a clue for you, whenever you make
a custom collection that acts like a list—or one of the other
standard collection types. If your object internally stores its data
in one of the standard data types, you’ll often have an easier time
mimicking its behavior.
Sometimes you can shortcut even more by inheriting from something
which is iterable. Another (and perhaps better5) way to implement
UniqueList
is to inherit from list
:
class
UniqueList
(
list
):
def
__init__
(
self
,
items
):
for
item
in
items
:
self
.
append
(
item
)
def
append
(
self
,
item
):
if
item
not
in
self
:
super
()
.
append
(
item
)
Then you can treat it as you would any Python list, with its full iterable qualities.
Writing a __getitem__()
method which acts like a list’s is one way
to make your class iterable. (And optionally adding __len__()
.)
The other involves writing an __iter__()
method. When called with no
arguments, it must return some object which follows the iterator
protocol, described above. In the worst case, you’ll need to implement
something like the SquaresIterator
class from earlier in this
chapter, with its own __next__()
and __iter__()
methods. But
usually you don’t have to work that hard—you can simply
return a generator object instead. That means __iter__()
is a
generator function itself, or it internally calls some other generator
function, returning its value.
Iterators must always have an __iter__()
method, as do some
iterables. Both are called with no argument, and both return an
iterator object. The only difference: the __iter__()
for the
iterator returns its self
, while an iterable’s __iter__()
will
create and return a new iterator. And if you call it twice, you get
two different iterators.
This similarity is intentional, to simplify control code that can
accept either iterators or iterables. Here’s the mental model you can
safely follow: when Python’s runtime encounters a for
loop, it will
start by invoking iter(sequence)
. This always returns an iterator:
either sequence
itself, or (if sequence
is only iterable) the
iterator created by sequence.__iter__()
.
Iterables are everywhere in Python. Almost all built-in collection
types are iterable: list
, tuple
, and set
, and even dict
.
(Though you’ll probably want to use dict.items()
—a simple for x
in some_dict
will iterate just over the keys).
In your own custom collection classes, sometimes the easiest way to
implement __iter__()
actually involves using iter()
. For
instance, this will not work:
class
BrokenInLoops
:
def
__init__
(
self
):
self
.
items
=
[
7
,
3
,
9
]
def
__iter__
(
self
):
return
self
.
items
If you try it, you get a TypeError
:
>>>
items
=
BrokenInLoops
()
>>>
for
item
in
items
:
...
(
item
)
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
TypeError
:iter() returned non-iterator of type 'list'
It doesn’t work because __iter__()
is supposed to return an iterator, but a
list object is not an iterator; it is simply iterable. You can fix
this with a one-line change: use iter()
to create an iterator object
inside of __iter__()
, and return that object:
class
WorksInLoops
:
def
__init__
(
self
):
self
.
items
=
[
7
,
3
,
9
]
def
__iter__
(
self
):
# This class is identical to BrokenInLoops,
# except for this next line.
return
iter
(
self
.
items
)
This makes WorksInLoops
itself iterable, because __iter__()
returns an actual iterator object—making WorksInLoops
follow the
iterator protocol correctly. That __iter__()
method generates a
fresh iterator each time:
>>>
items
=
WorksInLoops
()
>>>
for
item
in
items
:
...
(
item
)
7
3
9
>>>
for
another_item
in
items
:
...
(
another_item
)
7
3
9
Conclusion
Infusing your Python code with generators has a profound effect. All the code you write becomes more memory-efficient, more responsive, and more robust. Your programs are automatically able to gracefully handle larger input sizes than you anticipated. And this naturally boosts your reputation as someone who consistently creates high-quality software.
1 Remember: every generator object is an iterator. But not ever iterator is a generator object.
2 From “Summer Rain” by Amy Lowell, https://www.poets.org/poetsorg/poem/summer-rain
3 line.split()
returns a list of word strings. A lower-memory-footprint approach would be to create an iterator producing one word at a time.
4 This is a subset of what is called the sequence protocol. A Python object conforms to the sequence protocol if it defines __getitem__()
, and also __len__()
, which returns the length of the sequence. But iter()
is smart enough to work even if only the __getitem__()
method is defined.
5 This inheritance-based version demonstrates an “is-a” relationship, while the previous version a “has-a” relationship. Both will work, but because a UniqueList
could be considered a kind of list, one can argue that the latter version makes more logical sense, and it will work in more intuitive ways with functions like isinstance()
.
Get Powerful Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.