Chapter 1. Scaling with Generators

This for loop seems simple:

for item in items:
    do_something_with(item)

And yet, miracles hide here. As you probably know, going through a collection one element at a time is called iteration. Few understand how Python’s iteration system really works and appreciate how deep and well-thought-out it is. This chapter makes you one of those people. You gain the ability to write highly scalable Python applications, which can handle ever-larger data sets in performant, memory-efficient ways.

Iteration is also core to one of Python’s most powerful tools: the generator function. Generator functions are not just a convenient way to create useful iterators. They enable some exquisite patterns of code organization, in a way that—by their very nature—intrinsically encourage excellent coding habits.

This chapter is special, because understanding it threatens to make you a permanently better programmer in every language. Mastering Python generators tends to do that, because of the distinctions and insights you gain along the way. Let’s dive in.

Iteration in Python

Python has a built-in function called iter(). When you pass it a collection, you get back an iterator object:

>>> numbers = [7, 4, 11, 3]
>>> iter(numbers)
<list_iterator object at 0x10219dc50>

An iterator is an object producing the values in a sequence, one at a time:

>>> numbers_iter = iter(numbers)
>>> for num in numbers_iter:
...     print(num)
7
4
11
3

You don’t normally need to use iter(). If you instead write for num in numbers, what Python effectively does under the hood is call iter() on that collection. This happens automatically. Whatever object it gets back is used as the iterator for that for loop:

# This...
for num in numbers:
    print(num)

# ... is effectively just like this:
numbers_iter = iter(numbers)
for num in numbers_iter:
    print(num)

An iterator over a collection is a separate object, with its own identity—which you can verify with id():

>>> # id() returns a unique number for each object.
... # Different objects will always have different IDs.
>>> id(numbers)
4330133896
>>> id(numbers_iter)
4330216640

How does iter() actually get the iterator? It can do this in several ways, but one relies on a magic method called __iter__(). This is a method any class (including yours) may define, and it is called with no arguments. Each time, it produces a fresh new iterator object object. Lists have this method, for example:

>>> numbers.__iter__
<method-wrapper '__iter__' of list object at 0x10130e4c8>
>>> numbers.__iter__()
<list_iterator object at 0x1013180f0>

Python makes a distinction between objects which are iterators, and objects which are iterable. We say an object is iterable if and only if you can pass it to iter() and get back a ready-to-use iterator. If that object has an __iter__() method, iter() will call it to get the iterator. Python lists and tuples are iterable. So are strings, which is why you can write for char in my_str: to iterate over my_str’s characters. Any container you might use in a for loop is iterable.

A for loop is the most common way to step through a sequence. But sometimes your code needs to step through in a finer-grained way. For this, use the built-in function next(). You normally call it with a single argument, which is an iterator. Each time you call it, next(my_iterator) fetches and returns the next element:

>>> names = ["Tom", "Shelly", "Garth"]
>>> # Create a fresh iterator...
>>> names_it = iter(names)
>>> next(names_it)
'Tom'
>>> next(names_it)
'Shelly'
>>> next(names_it)
'Garth'

What happens if you call next(names_it) again? next() will raise a special built-in exception, called StopIteration:

>>> next(names_it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Raising this specific exception is how an iterator signals that the sequence is done. You rarely have to raise or catch this exception yourself, though we’ll see some patterns later where it is useful to do so. A good mental model for how a for loop works is to imagine it calling next() each time through the loop, exiting when StopItera⁠tion gets raised.

When using next() yourself, you can provide a second argument, for the default value. If you do, next() will return that instead of raising StopIteration at the end:

>>> names = ["Tom", "Shelly", "Garth"]
>>> new_names_it = iter(names)
>>> next(new_names_it, "Rick")
'Tom'
>>> next(new_names_it, "Rick")
'Shelly'
>>> next(new_names_it, "Rick")
'Garth'
>>> next(new_names_it, "Rick")
'Rick'
>>> next(new_names_it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>> next(new_names_it, "Jane")
'Jane'

Consider a different situation. What if you aren’t working with a simple sequence of numbers? What if you are calculating or reading or otherwise obtaining the sequence elements as you go along?

Let’s start with a simple example. Suppose you need to write a function creating a list of square numbers which will be processed by other code:

def fetch_squares(max_root):
    squares = []
    for n in range(max_root):
        squares.append(n**2)
    return squares

MAX = 5
for square in fetch_squares(MAX):
    do_something_with(square)

This works. But there are potential problems lurking here. Can you spot any?

Here’s one: what if MAX is not 5, but 10,000,000? Or 10,000,000,000? Or more? Your memory footprint will be pointlessly dreadful: the code here creates a massive list, uses it once, then throws it away. On top of that, the consuming for loop cannot even start until the entire list of squares has been fully calculated. If some poor human is using this program, they’ll wonder if it got stuck.

Even worse: What if you are not doing arithmetic to get each element—which is fast and cheap—but making a truly expensive calculation? Or making an API call over the network? Or reading from a database? Your program will be sluggish, perhaps unresponsive, and might even crash with an out-of-memory error. Its users will think you’re a terrible programmer.

The solution is to create an iterator to start with, lazily computing each value only when needed. Then each cycle through the loop happens just in time.

So how do you do that? It turns out there are several ways. But the best way is called a generator function. And you’re going to love it!

Generator Functions

Python provides a tool called the generator function, which…well, it’s hard to describe everything it gives you in one sentence. Of its many talents, I’ll first focus on how it is a very useful shortcut for creating iterators.

A generator function looks a lot like a regular function. But instead of saying return, it uses a new and different keyword: yield. Here’s a simple example:

def gen_nums():
    n = 0
    while n < 4:
        yield n
        n += 1

Use it in a for loop like this:

>>> for num in gen_nums():
...     print(num)
0
1
2
3

Let’s go through and understand this. When you call gen_nums() like a function, it immediately returns a generator object:

>>> sequence = gen_nums()
>>> type(sequence)
<class 'generator'>

The generator function is gen_nums()—what we define and then call. A function is a generator function if and only if it uses “yield” instead of “return”. The generator object is what that generator function returns when called—sequence, in this case.

Memorize This Fact

A generator function will always return a generator object. It can never return anything else.

This generator object is an iterator, which means you can iterate through it using next() or a for loop:

>>> sequence = gen_nums()
>>> next(sequence)
0
>>> next(sequence)
1
>>> next(sequence)
2
>>> next(sequence)
3
>>> next(sequence)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

>>> # Or in a for loop:
... for num in gen_nums():
...    print(num)
0
1
2
3

The flow of code works like this: when next() is called the first time, or the for loop first starts, the body of gen_nums() starts executing at the beginning, returning the value to the right of the yield.

Advancing next()

So far, this is much like a regular function. But the next time next() is called—or, equivalently, the next time through the for loop—the function does not start at the beginning again. It starts on the line after the yield statement. Look at the source of gen_nums() again:

def gen_nums():
    n = 0
    while n < 4:
        yield n
        n += 1

gen_nums() is more general than a function or subroutine. It is actually a coroutine. You see, a regular function can have several exit points (otherwise known as return statements). But it has only one entry point: each time you call a function, it always starts on the first line of the function body.

A coroutine is like a function, except it has several possible entry points. It starts with the first line, like a normal function. But when it “returns”, the coroutine is not exiting, so much as pausing. Subsequent calls with next()—or equivalently, the next time through the for loop—start at that yield statement again, right where it left off; the reentry point is the line after the yield statement.

And that’s the key: Each yield statement simultaneously defines an exit point, AND a reentry point.

For generator objects, each time a new value is requested, the flow of control picks up on the line after the yield statement. In this case, the next line increments the variable n, then continues with the while loop.

Notice we do not raise StopIteration anywhere in the body of gen_nums(). When the function body finally exits—after it exits the while loop, in this case—the generator object automatically raises StopIteration.

Again: each yield statement simultaneously defines an exit point and a reentry point. In fact, you can have multiple yield statements in a generator:

def gen_extra_nums():
    n = 0
    while n < 4:
        yield n
        n += 1
    yield 42 # Second yield

Here’s the output you will get when you use it:

>>> for num in gen_extra_nums():
...     print(num)
0
1
2
3
42

The second yield is reached after the while loop exits. When the function reaches the implicit return at the end, the iteration stops. Reason through the code above, and convince yourself it makes sense.

Converting to a Generator Function

Let’s revisit the earlier example of cycling through a sequence of squares. This is how we first did it:

def fetch_squares(max_root):
    squares = []
    for num in range(max_root):
        squares.append(num**2)
    return squares

MAX = 5
for square in fetch_squares(MAX):
    do_something_with(square)

As an exercise, pause here, open up a new Python file, and see if you can write a gen_squares() generator function that accomplishes the same thing.

Done? Great. Here’s what it looks like:

def gen_squares(max_root):
    for num in range(max_root):
        yield num ** 2

>>> MAX = 5
>>> for square in gen_squares(MAX):
...     print(square)
0
1
4
9
16

Notice something important. gen_squares() includes the built-in range() function. This returns an iterable object. That is important.

Because imagine range() returned a list. If that’s the case, and MAX is huge, that creates a huge list inside your generator function, completely destroying its scalability.

The larger point: Generator functions are only as scalable as their least scalable line of code. Generator functions potentially have a small memory footprint. But only if you code them intelligently. When writing generator functions, watch out for hidden bottlenecks like this.

Do You Need Generators?

Strictly speaking, we don’t need generator functions for iteration. We just want them, because they make useful patterns of scalability far easier.

For example: can you create an iterator without writing a generator function? Yes, you can. For creating a list of square numbers, you can do it like this:

class SquaresIterator:
    def __init__(self, max_root_value):
        self.max_root_value = max_root_value
        self.current_root_value = 0
    def __iter__(self):
        return self
    def __next__(self):
        if self.current_root_value >= self.max_root_value:
            raise StopIteration
        square_value = self.current_root_value ** 2
        self.current_root_value += 1
        return square_value

# You can use it like this:
for square in SquaresIterator(5):
    print(square)

Each value is obtained by invoking its __next__() method, until it raises StopItera⁠tion. This produces the same output; but take a look at the source for the SquaresIterator class and compare it to the source for the generator above. Which is easier to read? Which is easier to maintain? And when requirements change, which is easier to modify without introducing errors? Most people find the generator solution easier and more natural.

 Authors often use the word "generator" by itself,
to mean either the generator function, _or_ the generator object
returned when you call it. Typically the writer thinks the intended
meaning is obvious from the context. Sometimes it is, but often
not. Sometimes the writer is not fully clear on the distinction to
begin with. But it is an important distinction to get. Just as there
is a big difference between a function and the value it returns when
you call it, so is there a big difference between the generator
function and the generator object it returns.

In your own thought and speech, I encourage you to only use the phrases “generator function” and “generator object”, so you are always clear inside yourself, and in your communication. (This also helps your teammates be more clear.) The only exception: when you truly mean “generator functions and objects”, referring to the language feature as a broad concept, then it’s okay to just say “generators”. I’ll lead by example in this book.

Generator Patterns and Scalable Composability

Here’s a little generator function:

def matching_lines_from_file(path, pattern):
    with open(path) as handle:
        for line in handle:
            if pattern in line:
                yield line.rstrip('\n')

This function, matching_lines_from_file(), demonstrates several important practices for modern Python, and is worth studying. It does simple substring matching on lines of a text file, yielding lines containing that substring.

The first line opens a read-only file object, called handle. If you haven’t been opening your file objects using with statements, start today. The main benefit is that once the with block is exited, the file object is automatically closed—even if an exception causes a premature exit. It’s similar to:

try:
    handle = open(path)
    # read from handle
finally:
    handle.close()

(The try/finally is explained in Chapter 5.) Next we have for line in handle. This useful idiom—which not many people seem to know about—works in a particular way for text files. With each iteration through the for loop, a new line of text will be read from the underlying text file, and placed in the line variable.

Sometimes people foolishly take another approach, which I have to warn you about:

# Don't do this!!
for line in handle.readlines():

The readlines() (plural) method reads in the entire file, parses it into lines, and returns a list of strings—one string per line. By now, you realize how this can destroy scalability.

It bears repeating: a generator function is only as scalable as its least scalable line. So code carefully, lest you create some memory bottleneck that renders the generator function pointless.

Another approach you will sometimes see, which is scalable, is to use the file object’s .readline() method (singular), which manually returns lines one at a time:

# .readline() reads and returns a single line of text,
# or returns the empty string at end-of-file.
line = handle.readline()
while line != '':
    # do something with line
    # ...
    # At the end of the loop, read the next line.
    line = handle.readline()

But simply writing for line in handle is clearer and easier.

Assignment Expressions

A better way to use .readline() is with an assignment expression:

while (line := handle.readline()) != '':
    # do something with line
    # ...
    # No need to call .readline() again here.

An assignment statement, like x = y + 1, does not itself have a value. In other words, you cannot write code like if (x = y + 1) > 2: in Python. This was a deliberate choice in the language design, to make bugs like accidentally using = instead of == impossible.

But sometimes it is useful to assign a value, and use that value on the same line. This while loop is a prime example. That is why Python added the := operator.

By using :=, we only have to write line := handle.readline() once instead of line = handle.readline() twice. This is not just syntactic convenience; it can often eliminate repetitive code, as demonstrated here, and in some cases even lets us avoid duplicate computations.

This is properly called an assignment expression, but many call it the “walrus operator”, because it kind of looks like a walrus if you squint and turn your head just right.

After that, it’s straightforward: matching lines have any trailing \n-character stripped, and are yielded to the consumer. When writing generator functions, ask yourself: “What is the maximum memory footprint of this function, and how can I minimize it?” You can think of scalability as inversely proportional to this footprint. For matching_lines_from_file(), it will be about equal to the size of the longest line in the text file. So it is appropriate for the typical human-readable text file, whose lines are small.

(It’s also possible to point it to, say, a ten-terabyte text file consisting of exactly one line. If you expect something like that, you’ll need a different approach.)

Text Lines to Dicts

Now, suppose a log file contains lines like these:

WARNING: Disk usage exceeding 85%
DEBUG: User 'tinytim' upgraded to Pro version
INFO: Sent email campaign, completed normally
WARNING: Almost out of beer

Say you exercise matching_lines_from_file() like so:

for line in matching_lines_from_file("log.txt", "WARNING:"):
    print(line)

That yields these records:

WARNING: Disk usage exceeding 85%
WARNING: Almost out of beer

Suppose your application needs that data in dict form:

{"level": "WARNING", "message": "Disk usage exceeding 85%"}
{"level": "DEBUG", "message":
    "User 'tinytim' upgraded to Pro version"}

We want to scalably transform the records from one form to another—from strings (lines of the log file) to Python dicts. So let’s make a new generator function to connect them:

def parse_log_records(lines):
    for line in lines:
        level, message = line.split(": ", 1)
        yield {"level": level, "message": message}

Now we can connect the two:

# log_lines is a generator object
log_lines = matching_lines_from_file("log.txt", "WARNING:")
for record in parse_log_records(log_lines):
    # record is a dict
    print(record)

Of course, parse_log_records() can be used on its own:

with open("log.txt") as handle:
    for record in parse_log_records(handle):
        print(record)

matching_lines_from_file() and parse_log_records() are like building blocks. Properly designed, they can be used to build different data processing streams. I call this scalable composability. It goes beyond designing composable functions and types. Ask yourself how you can make the components scalable, and whatever is assembled out of them scalable too.

Composable Interfaces

Let’s discuss a particular design point. Both matching_lines_from_file() and parse_log_records() produce an iterator. (Or, more specifically, a generator object.)¹ But they have a discrepancy on the input side: parse_log_records() accepts an iterator, but matching_lines_from_file() requires a path to a file to read from. This means matching_lines_from_file() combines two functions: reading lines of text from a file, then filtering those lines based on some criteria.

Combining functions like this is often what you want in realistic code. But when designing components to flexibly compose together, inconsistent interfaces like this can be limiting. Let’s break up the services in matching_lines_from_file() into two generator functions:

def lines_from_file(path):
    with open(path) as handle:
        for line in handle:
            yield line.rstrip('\n')

def matching_lines(lines, pattern):
    for line in lines:
        if pattern in line:
            yield line

You can compose these like so:

lines = lines_from_file("log.txt")
matching = matching_lines(lines, 'WARNING:')
for line in matching:
    print(line)

Or even redefine matching_lines_from_file() in terms of them:

def matching_lines_from_file(pattern, path):
    lines = lines_from_file(path)
    matching = matching_lines(lines, pattern)
    for line in matching:
        yield line

Conceptually, this factored-out matching_lines does a filtering operation; all lines are read in, and a subset of them are yielded. parse_log_records() is different. One input record (a str line) maps to exactly one output record (a dict). Mathematically, it’s a mapping operation. Think of it as a transformer or adapter. lines_from_file() is in a third category; instead of taking a stream as input, it takes a completely different parameter. Since it still returns an iterator of records, think of it as a source. And a real program will eventually want to do something with that stream, consuming it without producing another iterator; call that a sink.

You need all these pieces to make a working program. When designing a chainable set of generator functions like this—think of it as a toolkit for constructing internal data pipelines—ask yourself whether each component is a sink or a source; whether it does filtering or mapping; or whether it’s some combination of these. Just asking yourself this question leads to a more usable, readable, and maintainable codebase. And if you’re making a library which others will use, you’re more likely to end up with a toolkit so powerfully flexible, people will use it to build programs you never imagined.

Fanning Out

I want you to notice something about parse_log_records(). As I said, it fits in the “mapping” category. And notice its mapping is one input (line of text) to one output (dictionary). In other words, each record in the input (a str) becomes exactly one record in the output (a dict).

That isn’t always the case. Sometimes, your generator function needs to consume several input records to create one output record. Or the opposite: one input record yields several output records.

Here’s an example of the latter. Imagine a text file containing lines in a poem:²

all night our room was outer-walled with rain
drops fell and flattened on the tin roof
and rang like little disks of metal

Let’s create a generator function, words_in_text(), producing the words one at a time. Here is a first approach:

# lines is an iterator of text file lines,
# e.g. returned by lines_from_file()
def words_in_text(lines):
    for line in lines:
        for word in line.split():
            yield word

This generator function³ takes a fan out approach. No input records are dropped, which means it doesn’t do any filtering; it’s still purely in the “mapping” category of generator functions. But the mapping isn’t one-to-one. Rather, one input record produces one or more output records. Run the following code:

poem_lines = lines_from_file("poem.txt")
poem_words = words_in_text(poem_lines)
for word in poem_words:
    print(word)

It will produce this output:

all
night
our
room
was
outer-walled
...

That first input record—“all night our room was outer-walled with rain”—yields eight words (output records). Ignoring any blank lines in the poem, every line of prose will produce at least one word—probably several.

Fanning In

The idea of fanning out is interesting, but simple enough. It’s more complex when we go the opposite direction: fanning in. That means the generator function consumes more than one input record to produce each output record. Doing this successfully requires awareness of the input’s structure, and you’ll typically need to encode some simple parsing logic.

Imagine a text file containing residential house sales data. Each record is a set of key-value pairs, one pair per line, with records separated by blank lines:

address: 1423 99th Ave
square_feet: 1705
price_usd: 340210

address: 24257 Pueblo Dr
square_feet: 2305
price_usd: 170210

address: 127 Cochran
square_feet: 2068
price_usd: 320500

To read this data into a form usable in our code, what we want is a generator function—let’s name it house_records()—which accepts a sequence of strings (lines) and parses them into convenient dictionaries:

>>> lines_of_house_data = lines_from_file("housedata.txt")
>>> houses = house_records(lines_of_house_data)
>>> # Fetch the first record.
... house = next(houses)
>>> house['address']
'1423 99th Ave'
>>> house = next(houses)
>>> house['address']
'24257 Pueblo Dr'

How would you create this? If practical, try it: pause here, open up a code editor, and see if you can implement it.

Okay, time’s up. Here is one approach:

def house_records(lines):
    record = {}
    for line in lines:
        if line == '':
            yield record
            record = {}
            continue
        key, value = line.split(': ', 1)
        record[key] = value
    yield record

Notice where the yield keywords appear. The last line of the for loop reads individual key-value pairs. An empty record dictionary is populated with data until lines produces an empty line. That signals the current record is complete, so it’s yield-ed, and a new record dictionary created. The end of the very last record in housedata.txt is signaled not by an empty line, but by the end of the file; that’s why we need the final yield statement.

As defined, house_records() is a bit clunky if we’re normally reading from a text file. It makes sense to define a new generator function accepting just the path to the file:

def house_records_from_file(path):
    lines = lines_from_file(path)
    for house in house_records(lines):
        yield house

# Then in your program:
for house in house_records_from_file("housedata.txt"):
    print(house["address"])

You may have noticed many of these examples have a bit of boilerplate when one generator function internally calls another. The last two lines of house_records_from_file say:

    for house in house_records(lines):
        yield house

Python provides a shortcut to accomplish this in one line, with the yield from statement:

def house_records_from_file(path):
    lines = lines_from_file(path)
    yield from house_records(lines)

Even though “yield from” is two words, semantically it’s a single keyword, distinct from yield. The yield from statement is used specifically in generator functions, when they yield values directly from another generator object (or, equivalently, by calling another generator function). Using it often simplifies your code, as you see in house_records_from_file().

Going back a bit, here’s how it works with matching_lines_from_file():

def matching_lines_from_file(pattern, path):
    lines = lines_from_file(path)
    yield from matching_lines(lines, pattern)

The formal name for what yield from does is “delegating to a subgenerator”, which instills a deeper connection between the containing and inner generator objects. In particular, generator objects have certain methods—send(), throw(), and close()—for passing information back into the context of the running generator function. I won’t cover them here, as they are currently not widely used; you can learn more by reading PEPs 342 and 380. If you do use them, yield from becomes necessary to enable the flow of information back into the scope of the running coroutine.

Python Is Filled with Iterators

Iteration has snuck into many places in Python. The built-in range() returns an iterable:

>>> seq = range(3)
>>> type(seq)
<class 'range'>
>>> for n in seq: print(n)
0
1
2

The built-in map, filter, zip, and enumerate functions all return iterators:

>>> numbers = [1, 2, 3]
>>> big_numbers = [100, 200, 300]
>>> def double(n):
...     return 2 * n
>>> def is_even(n):
...     return n % 2 == 0
>>> mapped = map(double, numbers)
>>> mapped
<map object at 0x1013ac518>
>>> for num in mapped: print(num)
2
4
6

>>> filtered = filter(is_even, numbers)
>>> filtered
<filter object at 0x1013ac668>
>>> for num in filtered: print(num)
2

>>> zipped = zip(numbers, big_numbers)
>>> zipped
<zip object at 0x1013a9608>
>>> for pair in zipped: print(pair)
(1, 100)
(2, 200)
(3, 300)

Notice that mapped is something called a “map object”, rather than a list of the results of the calculation; filtered and zipped are similar. These are all iterators—giving you all the benefits of iteration, built into the language.

The Iterator Protocol

This optional section explains Python’s iterator protocol in formal detail, giving you a precise and low-level understanding of how generators, iterators, and iterables all work. For the day-to-day coding of most programmers, it’s not nearly as important as everything else in this chapter. That said, you need this information to implement custom iterable collection types. Personally, I also find knowing the protocol helps me reason through iteration-related issues and edge cases; by knowing these details, I’m able to quickly troubleshoot and fix certain bugs that might otherwise eat up my afternoon.

If this all sounds valuable to you, keep reading; otherwise, feel free to skip to the next chapter. You can always come back to read it later.

As mentioned, Python makes a distinction between iterators, versus objects that are iterable. The difference is subtle to begin with, and frankly it doesn’t help that the two words sound nearly identical. Keep clear in your mind that iterator and iterable are distinct but related concepts, and the following will be easier to understand.

Informally, an iterator is something you can pass to next(), or use exactly once in a for loop. More formally, an object in Python is an iterator if it follows the iterator protocol. And an object follows the iterator protocol if it meets the following criteria:

It defines a method named __next__(), called with no arguments.
Each time __next__() is called, it produces the next item in the sequence.
Until all items have been produced. Then, subsequent calls to __next__() raise StopIteration.
It also defines a boilerplate method named __iter__(), called with no arguments, and returning the same iterator. Its body is literally return self.

Any object with these methods can call itself a Python iterator. You are not intended to call the __next__() method directly. Instead, you will use the built-in next() function.

To understand better, here is how you might write your own next() function:

_NO_DEFAULT = object()
def next(it, default=_NO_DEFAULT):
    try:
        return it.__next__()
    except StopIteration:
        if default is _NO_DEFAULT:
            raise
        return default

(As a side note, notice how this function creates a unique sentinel value, _NO_DEFAULT, rather than defaulting to a built-in value like None. A sentinel value is a value that exists solely for signaling in the algorithm, and is meant to never overlap with a possible value for real data. This way, you can pass any value to default that you like without conflict.)

Now, all the above is for the iterator. Let’s explain the other word, “iterable”. Informally, an object is iterable if you can use it in a for loop. More formally, a Python object is iterable if it meets one of these two criteria:

It defines a method called __iter__(), which creates and returns an iterator over the elements in the container; or
It defines __getitem__()—the magic method for square brackets—and lets you reference foo[0], foo[1], etc., raising an IndexError once you go past the last element.⁴

(Notice “iterator” is a noun, while “iterable” is usually an adjective. This can help you remember which is which.)

When implementing your own container type, you probably want to make it iterable, so you and others can use it in a for loop. Depending on the nature of the container, it’s often easiest to implement the sequence protocol. As an example, consider this UniqueList type, which is a kind of hybrid between a list and a set:

class UniqueList:
    def __init__(self, items):
        self.items = []
        for item in items:
            self.append(item)
    def append(self, item):
        if item not in self.items:
            self.items.append(item)
    def __getitem__(self, index):
        return self.items[index]

Use it like this:

>>> u = UniqueList([3,7,2,9,3,4,2])
>>> u.items
[3, 7, 2, 9, 4]
>>> u[3]
9
>>> u[42]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in __getitem__
IndexError: list index out of range

The __getitem__() method implements square-bracket access; basically, Python translates u[3] into u.__getitem__(3). We’ve programmed this object’s square brackets to operate much like a normal list, in that the initial element is at index 0, and you get subsequent elements with subsequent integers, not skipping any. And when you go past the end, it raises IndexError. If an object has a __getitem__() method behaving in just this way, iter() knows how to create an iterator over it:

>>> u_iter = iter(u)
>>> type(u_iter)
<class 'iterator'>
>>> for num in u_iter:
...     print(num)
3
7
2
9
4

Notice we get a lot of this behavior for free, simply because we’re using an actual list internally (and thus delegating much of the __getitem__() logic to it). That’s a clue for you, whenever you make a custom collection that acts like a list—or one of the other standard collection types. If your object internally stores its data in one of the standard data types, you’ll often have an easier time mimicking its behavior.

Sometimes you can shortcut even more by inheriting from something which is iterable. Another (and perhaps better⁵) way to implement UniqueList is to inherit from list:

class UniqueList(list):
    def __init__(self, items):
        for item in items:
            self.append(item)
    def append(self, item):
        if item not in self:
            super().append(item)

Then you can treat it as you would any Python list, with its full iterable qualities.

Writing a __getitem__() method which acts like a list’s is one way to make your class iterable. (And optionally adding __len__().) The other involves writing an __iter__() method. When called with no arguments, it must return some object which follows the iterator protocol, described above. In the worst case, you’ll need to implement something like the SquaresIterator class from earlier in this chapter, with its own __next__() and __iter__() methods. But usually you don’t have to work that hard—you can simply return a generator object instead. That means __iter__() is a generator function itself, or it internally calls some other generator function, returning its value.

Iterators must always have an __iter__() method, as do some iterables. Both are called with no argument, and both return an iterator object. The only difference: the __iter__() for the iterator returns its self, while an iterable’s __iter__() will create and return a new iterator. And if you call it twice, you get two different iterators.

This similarity is intentional, to simplify control code that can accept either iterators or iterables. Here’s the mental model you can safely follow: when Python’s runtime encounters a for loop, it will start by invoking iter(sequence). This always returns an iterator: either sequence itself, or (if sequence is only iterable) the iterator created by sequence.__iter__().

Iterables are everywhere in Python. Almost all built-in collection types are iterable: list, tuple, and set, and even dict. (Though you’ll probably want to use dict.items()—a simple for x in some_dict will iterate just over the keys).

In your own custom collection classes, sometimes the easiest way to implement __iter__() actually involves using iter(). For instance, this will not work:

class BrokenInLoops:
    def __init__(self):
        self.items = [7, 3, 9]
    def __iter__(self):
        return self.items

If you try it, you get a TypeError:

>>> items = BrokenInLoops()
>>> for item in items:
...     print(item)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: iter() returned non-iterator of type 'list'

It doesn’t work because __iter__() is supposed to return an iterator, but a list object is not an iterator; it is simply iterable. You can fix this with a one-line change: use iter() to create an iterator object inside of __iter__(), and return that object:

class WorksInLoops:
    def __init__(self):
        self.items = [7, 3, 9]
    def __iter__(self):
        # This class is identical to BrokenInLoops,
        # except for this next line.
        return iter(self.items)

This makes WorksInLoops itself iterable, because __iter__() returns an actual iterator object—making WorksInLoops follow the iterator protocol correctly. That __iter__() method generates a fresh iterator each time:

>>> items = WorksInLoops()
>>> for item in items:
...     print(item)
7
3
9
>>> for another_item in items:
...     print(another_item)
7
3
9

Conclusion

Infusing your Python code with generators has a profound effect. All the code you write becomes more memory-efficient, more responsive, and more robust. Your programs are automatically able to gracefully handle larger input sizes than you anticipated. And this naturally boosts your reputation as someone who consistently creates high-quality software.

¹ Remember: every generator object is an iterator. But not ever iterator is a generator object.

² From “Summer Rain” by Amy Lowell, https://www.poets.org/poetsorg/poem/summer-rain

³ line.split() returns a list of word strings. A lower-memory-footprint approach would be to create an iterator producing one word at a time.

⁴ This is a subset of what is called the sequence protocol. A Python object conforms to the sequence protocol if it defines __getitem__(), and also __len__(), which returns the length of the sequence. But iter() is smart enough to work even if only the __getitem__() method is defined.

⁵ This inheritance-based version demonstrates an “is-a” relationship, while the previous version a “has-a” relationship. Both will work, but because a UniqueList could be considered a kind of list, one can argue that the latter version makes more logical sense, and it will work in more intuitive ways with functions like isinstance().

Get Powerful Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Powerful Python by Aaron Maxwell