Influenzmaschine
Influenzmaschine (source: Hans via Pixabay).

I had no idea how important it would turn out to be. It changed the way I wrote software forever, in every language.

By "it," I mean a Python construct called generators. It’s hard to convey everything they can do for you in a few words; you’ll have to keep reading to find out. In a nutshell: they help your program efficiently scale, while at the same time providing delightful encapsulation patterns. Let me explain.

A generator looks a lot like a function, but uses the keyword "yield" instead of "return" (you can skip ahead a bit if you're already familiar):

def gen_nums():
    n = 0
    while n < 4:
        yield n
        n += 1

That's a generator function. When you call it, it returns a generator object:

>>> nums = gen_nums()
>>> type(nums)
<class 'generator'>

A generator object is a Python iterator, so you can use it in a for loop, like this:

>>> for num in nums:
...     print(num)
0
1
2
3

(Notice how state is encapsulated within the body of the generator function. That has interesting implications, as you read to the end.) You can also step through one by one, using the built-in next() function:

>>> more_nums = gen_nums()
>>> next(more_nums)
0
>>> next(more_nums)
1
>>> next(more_nums)
2
>>> next(more_nums)
3

A for-loop effectively calls this each time through the loop, to get the next value. Repeatedly calling next() ourselves makes it easier to see what's going on. Each time the generator yields, it pauses at that point in the "function." And when next() is called again, it picks up on the next line. (In this case, that means incrementing the counter, then going back to the top of the while loop.)

What happens if you call next() past the end?

>>> next(more_nums)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

StopIteration is a built-in exception type, automatically raised once the generator stops yielding. It's the signal to the for loop to stop, well, looping. Alternatively, you can pass a default value as the second argument to next():

>>> next(more_nums, 42)
42

Translating

My favorite way to use generators is translating one stream into another. Suppose your application is reading records of data from a large file, in this format:

article: Buzz_Aldrin
requests: 2183
bytes_served: 71437760

article: Isaac_Newton
requests: 25810
bytes_served: 1680833779

article: Main_Page
requests: 559944
bytes_served: 9458944804

article: Olympic_Games
requests: 1173
bytes_served: 147591514

...

This Wikipedia page data shows the total requests for different pages over a single hour. Imagine you want to access this in your program as a sequence of Python dictionaries:

# records is a list of dicts.
for record in records:
    print("{} had {} requests in the past hour".format(
        record["article"], record["requests"]))

And ideally without slurping in the entire large file. Each record is split over several lines, so you have to do something more sophisticated than "for line in myfile:". The nicest way I know to solve this problem in Python is using a generator:

def gen_records(path):
    with open(path) as handle:
        record = {}
        for line in handle:
            if line == "\n":
                yield record
                record = {}
                continue
            key, value = line.rstrip("\n").split(": ", 1)
            record[key] = value

That lets us do things like:

for record in gen_records('data.txt'):
    print("{} had {} requests in the past hour".format(
        record["article"], record["requests"]))

Someone using gen_records doesn't have to know, or care, that the source is a multiline-record format. There is complexity, yes, because that's necessary in this situation. But all that complexity is very nicely encapsulated in the generator function!

This pattern is especially valuable with continuous streams, like from a socket connection, or tailing a log file. Imagine a long-running program that listens on some source and continually working with the produced records - generators are perfect for this.

The benefits of generators

On one level, you can think of a Python generator as (among other things) a readable shortcut for creating iterators. Here's gen_nums again:

def gen_nums():
    n = 0
    while n < 4:
        yield n
        n += 1

If you couldn't use "yield," how would you make an equivalent iterator? Just define a class like this:

class Nums:
    MAX = 4
    def __init__(self):
        self.current = 0
    def __iter__(self):
        return self
    def __next__(self):
        next_value = self.current
        if next_value >= self.MAX:
            raise StopIteration
        self.current += 1
        return next_value

Yikes. An instance of this works just like the generator object above…

nums = Nums()
for num in nums:
    print(num)

>>> nums = Nums()
>>> for num in nums:
...     print(num)
0
1
2
3

…but what a price we paid. Look at how complex that class is, compared to the gen_nums function. Which of these two is easier to read? Which is easier to modify without screwing up your code? Which has the key logic all in one place? For me, the generator is immensely preferable.

And it illustrates the other benefit of generators mentioned earlier: encapsulation. It provides new and useful ways for you to package and isolate internal code dependencies. You can do the same thing with classes, but only by spreading state across several methods, in a way that's not nearly as easy to understand.

The complexity of the class-based approach may be why I never really learned about iterators until I learned about Python's yield. That's embarrassing to admit, because iteration is a critically important concept in software development, independent of language. Compared to using (say) a simple list, iterators tremendously reduce memory footprint; improve scalability; and make your code more responsive to the end user. Through this, Python taught me to be a better programmer in every language.

If you don't consciously use generators yet, learn to do so. You'll be so glad you did. That's why they are a core part of my upcoming Python Beyond the Basics in-person training in Boston on Oct. 10-12, and my Python—The Next Level online training being held Nov. 2-3.

Article image: Influenzmaschine (source: Hans via Pixabay).