Chapter 4. Dictionaries and Sets

Sets and dictionaries are ideal data structures to be used when your data has no intrinsic order (except for insertion order) but does have a unique object that can be used to reference it (the reference object is normally a string, but it can be any hashable type). This reference object is called the key, while the data is the value. Dictionaries and sets are almost identical, except that sets do not actually contain values: a set is simply a collection of unique keys. As the name implies, sets are very useful for doing set operations.

Note

A hashable type is one that implements both the __hash__ magic function and either __eq__ or __cmp__. All native types in Python already implement these, and any user classes have default values. See “Hash Functions and Entropy” for more details.

While we saw in the previous chapter that we are restricted to, at best, O(log n) lookup time on lists/tuples with no intrinsic order (through a search operation), dictionaries and sets give us O(1) lookups based on the arbitrary index. In addition, like lists/tuples, dictionaries and sets have O(1) insertion time.¹ As we will see in “How Do Dictionaries and Sets Work?”, this speed is accomplished through the use of an open address hash table as the underlying data structure.

However, there is a cost to using dictionaries and sets. First, they generally take up a larger footprint in memory. Also, although the complexity for insertions/lookups is O(1), the actual speed depends greatly on the hashing function that is in use. If the hash function is slow to evaluate, any operations on dictionaries or sets will be similarly slow.

Let’s look at an example. Say we want to store contact information for everyone in the phone book. We would like to store this in a form that will make it simple to answer the question “What is John Doe’s phone number?” in the future. With lists, we would store the phone numbers and names sequentially and scan through the entire list to find the phone number we required, as shown in Example 4-1.

Example 4-1. Phone book lookup with a list

def find_phonenumber(phonebook, name):
    for n, p in phonebook:
        if n == name:
            return p
    return None

phonebook = [
    ("John Doe", "555-555-5555"),
    ("Albert Einstein", "212-555-5555"),
]
print(f"John Doe's phone number is {find_phonenumber(phonebook, 'John Doe')}")

Note

We could also do this by sorting the list and using the bisect module (from Example 3-4) in order to get O(log n) performance.

With a dictionary, however, we can simply have the “index” be the names and the “values” be the phone numbers, as shown in Example 4-2. This allows us to simply look up the value we need and get a direct reference to it, instead of having to read every value in our dataset.

Example 4-2. Phone book lookup with a dictionary

phonebook = {
    "John Doe": "555-555-5555",
    "Albert Einstein" : "212-555-5555",
}
print(f"John Doe's phone number is {phonebook['John Doe']}")

For large phone books, the difference between the O(1) lookup of the dictionary and the O(n) time for linear search over the list (or, at best, the O(log n) complexity with the bisect module) is quite substantial.

Tip

Create a script that times the performance of the list-bisect method versus a dictionary for finding a number in a phone book. How does the timing scale as the size of the phone book grows?

If, on the other hand, we wanted to answer the question “How many unique first names are there in my phone book?” we could use the power of sets. Recall that a set is simply a collection of unique keys—this is the exact property we would like to enforce in our data. This is in stark contrast to a list-based approach, where that property needs to be enforced separately from the data structure by comparing all names with all other names. Example 4-3 illustrates.

Example 4-3. Finding unique names with lists and sets

def list_unique_names(phonebook):
    unique_names = []
    for name, phonenumber in phonebook: 
        first_name, last_name = name.split(" ", 1)
        for unique in unique_names: 
            if unique == first_name:
                break
        else:
            unique_names.append(first_name)
    return len(unique_names)

def set_unique_names(phonebook):
    unique_names = set()
    for name, phonenumber in phonebook: 
        first_name, last_name = name.split(" ", 1)
        unique_names.add(first_name) 
    return len(unique_names)

phonebook = [
    ("John Doe", "555-555-5555"),
    ("Albert Einstein", "212-555-5555"),
    ("John Murphey", "202-555-5555"),
    ("Albert Rutherford", "647-555-5555"),
    ("Guido van Rossum", "301-555-5555"),
]

print("Number of unique names from set method:", set_unique_names(phonebook))
print("Number of unique names from list method:", list_unique_names(phonebook))

,: We must go over all the items in our phone book, and thus this loop costs O(n).
: Here, we must check the current name against all the unique names we have already seen. If it is a new unique name, we add it to our list of unique names. We then continue through the list, performing this step for every item in the phone book.
: For the set method, instead of iterating over all unique names we have already seen, we can simply add the current name to our set of unique names. Because sets guarantee the uniqueness of the keys they contain, if you try to add an item that is already in the set, that item simply won’t be added. Furthermore, this operation costs O(1).

The list algorithm’s inner loop iterates over unique_names, which starts out as empty and then grows, in the worst case, when all names are unique, to be the size of phonebook. This can be seen as performing a linear search for each name in the phone book over a list that is constantly growing. Thus, the complete algorithm performs as O(n^2).

On the other hand, the set algorithm has no inner loop; the set.add operation is an O(1) process that completes in a fixed number of operations regardless of how large the phone book is (there are some minor caveats to this, which we will cover while discussing the implementation of dictionaries and sets). Thus, the only nonconstant contribution to the complexity of this algorithm is the loop over the phone book, making this algorithm perform in O(n).

When timing these two algorithms using a phonebook with 10,000 entries and 7,412 unique first names, we see how drastic the difference between O(n) and O(n^2) can be:

>>> %timeit list_unique_names(large_phonebook)
1.13 s ± 26.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit set_unique_names(large_phonebook)
4.48 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In other words, the set algorithm gave us a 252× speedup! In addition, as the size of the phonebook grows, the speed gains increase (we get a 557× speedup with a phonebook with 100,000 entries and 15,574 unique first names).

How Do Dictionaries and Sets Work?

Dictionaries and sets use hash tables to achieve their O(1) lookups and insertions. This efficiency is the result of a very clever usage of a hash function to turn an arbitrary key (i.e., a string or object) into an index for a list. The hash function and list can later be used to determine where any particular piece of data is right away, without a search. By turning the data’s key into something that can be used like a list index, we can get the same performance as with a list. In addition, instead of having to refer to data by a numerical index, which itself implies some ordering to the data, we can refer to it by this arbitrary key.

Inserting and Retrieving

To create a hash table from scratch, we start with some allocated memory, similar to what we started with for arrays. For an array, if we want to insert data, we simply find the smallest unused bucket and insert our data there (and resize if necessary). For hash tables, we must first figure out the placement of the data in this contiguous chunk of memory.

The placement of the new data is contingent on two properties of the data we are inserting: the hashed value of the key and how the value compares to other objects. This is because when we insert data, the key is first hashed and masked so that it turns into an effective index in an array.² The mask makes sure that the hash value, which can take the value of any integer, fits within the allocated number of buckets. So if we have allocated 8 blocks of memory and our hash value is 28975, we consider the bucket at index 28975 & 0b111 = 7. If, however, our dictionary has grown to require 512 blocks of memory, the mask becomes 0b111111111 (and in this case, we would consider the bucket at index 28975 & 0b11111111).

Now we must check if this bucket is already in use. If it is empty, we can insert the key and the value into this block of memory. We store the key so that we can make sure we are retrieving the correct value on lookups. If it is in use and the value of the bucket is equal to the value we wish to insert (a comparison done with the cmp built-in), then the key/value pair is already in the hash table and we can return. However, if the values don’t match, we must find a new place to put the data.

As an extra optimization, Python first appends the key/value data into a standard array and then stores only the index into this array in the hash table. This allows us to reduce the amount of memory used by 30–95%.³ In addition, this gives us the interesting property that we keep a record of the order which new items were added into the dictionary (which, since Python 3.7, is a guarantee that all dictionaries give).

To find the new index, we compute it using a simple linear function, a method called probing. Python’s probing mechanism adds a contribution from the higher-order bits of the original hash (recall that for a table of length 8 we considered only the last three bits of the hash for the initial index, through the use of a mask value of mask = 0b111 = bin(8 - 1)). Using these higher-order bits gives each hash a different sequence of next possible hashes, which helps to avoid future collisions.

There is a lot of freedom when picking the algorithm to generate a new index; however, it is quite important that the scheme visits every possible index in order to evenly distribute the data in the table. How well distributed the data is throughout the hash table is called the load factor and is related to the entropy of the hash function. The pseudocode in Example 4-4 illustrates the calculation of hash indices used in CPython 3.7. This also shows an interesting fact about hash tables: most of the storage space they have is empty!

Example 4-4. Dictionary lookup sequence

def index_sequence(key, mask=0b111, PERTURB_SHIFT=5):
    perturb = hash(key) 
    i = perturb & mask
    yield i
    while True:
        perturb >>= PERTURB_SHIFT
        i = (i * 5 + perturb + 1) & mask
        yield i

: hash returns an integer, while the actual C code in CPython uses an unsigned integer. Because of that, this pseudocode doesn’t replicate exactly the behavior in CPython; however, it is a good approximation.

This probing is a modification of the naive method of linear probing. In linear probing, we simply yield the values i = (i * 5 + perturb + 1) & mask, where i is initialized to the hash value of the key.⁴ An important thing to note is that linear probing deals only with the last several bits of the hash and disregards the rest (i.e., for a dictionary with eight elements, we look only at the last three bits since at that point the mask is 0x111). This means that if hashing two items gives the same last three binary digits, we will not only have a collision, but also the sequence of probed indices will be the same. The perturbed scheme that Python uses will start taking into consideration more bits from the items’ hashes to resolve this problem.

A similar procedure is done when we are performing lookups on a specific key: the given key is transformed into an index, and that index is examined. If the key in that index matches (recall that we also store the original key when doing insert operations), then we can return that value. If it doesn’t, we keep creating new indices using the same scheme, until we either find the data or hit an empty bucket. If we hit an empty bucket, we can conclude that the data does not exist in the table.

Figure 4-1 illustrates the process of adding data into a hash table. Here, we chose to create a hash function that simply uses the first letter of the input. We accomplish this by using Python’s ord function on the first letter of the input to get the integer representation of that letter (recall that hash functions must return integers). As we’ll see in “Hash Functions and Entropy”, Python provides hashing functions for most of its types, so typically you won’t have to provide one yourself except in extreme situations.

Insertion of the key Barcelona causes a collision, and a new index is calculated using the scheme in Example 4-4. This dictionary can also be created in Python using the code in Example 4-5.

Example 4-5. Custom hashing function

class City(str):
    def __hash__(self):
        return ord(self[0])

# We create a dictionary where we assign arbitrary values to cities
data =  {
    City("Rome"): 'Italy',
    City("San Francisco"): 'USA',
    City("New York"): 'USA',
    City("Barcelona"): 'Spain',
}

In this case, Barcelona and Rome cause the hash collision (Figure 4-1 shows the outcome of this insertion). We see this because, for a dictionary with four elements, we have a mask value of 0b111. As a result, Barcelona and Rome will try to use the same index:

hash("Barcelona") = ord("B") & 0b111
                  = 66 & 0b111
                  = 0b1000010 & 0b111
                  = 0b010 = 2

hash("Rome") = ord("R") & 0b111
             = 82 & 0b111
             = 0b1010010 & 0b111
             = 0b010 = 2

Exercise

Work through the following problems. A discussion of hash collisions follows:

Finding an element—Using the dictionary created in Example 4-5, what would a lookup on the key Johannesburg look like? What indices would be checked?
Deleting an element—Using the dictionary created in Example 4-5, how would you handle the deletion of the key Rome? How would subsequent lookups for the keys Rome and Barcelona be handled?
Hash collisions—Considering the dictionary created in Example 4-5, how many hash collisions could you expect if 500 cities, with names all starting with an uppercase letter, were added into a hash table? How about 1,000 cities? Can you think of a way of lowering the number of collisions?

For 500 cities, there would be approximately 474 dictionary elements that collided with a previous value (500 – 26), with each hash having 500 / 26 = 19.2 cities associated with it. For 1,000 cities, 974 elements would collide, and each hash would have 1,000 / 26 = 38.4 cities associated with it. This is because the hash is based simply on the numerical value of the first letter, which can take only a value from A–Z, allowing for only 26 independent hash values. This means that a lookup in this table could require as many as 38 subsequent lookups to find the correct value. To fix this, we must increase the number of possible hash values by considering other aspects of the city in the hash. The default hash function on a string considers every character in order to maximize the number of possible values. See “Hash Functions and Entropy” for more explanation.

Deletion

When a value is deleted from a hash table, we cannot simply write a NULL to that bucket of memory. This is because we have used NULLs as a sentinel value while probing for hash collisions. As a result, we must write a special value that signifies that the bucket is empty, but there still may be values after it to consider when resolving a hash collision. So if “Rome” was deleted from the dictionary, subsequent lookups for “Barcelona” will first see this sentinel value where “Rome” used to be and instead of stopping, continue to check the next indices given by the index_sequence. These empty slots can be written to in the future and are removed when the hash table is resized.

Resizing

As more items are inserted into the hash table, the table itself must be resized to accommodate them. It can be shown that a table that is no more than two-thirds full will have optimal space savings while still having a good bound on the number of collisions to expect. Thus, when a table reaches this critical point, it is grown. To do this, a larger table is allocated (i.e., more buckets in memory are reserved), the mask is adjusted to fit the new table, and all elements of the old table are reinserted into the new one. This requires recomputing indices, since the changed mask will change the resulting index. As a result, resizing large hash tables can be quite expensive! However, since we do this resizing operation only when the table is too small, as opposed to doing it on every insert, the amortized cost of an insert is still O(1).⁵

By default, the smallest size of a dictionary or set is 8 (that is, if you are storing only three values, Python will still allocate eight elements), and it will resize by 3× if the dictionary is more than two-thirds full. So once the sixth item is being inserted into the originally empty dictionary, it will be resized to hold 18 elements. At this point, once the 13th element is inserted into the object, it will be resized to 39, then 81, and so on, always increasing the size by 3× (we will explain how to calculate a dictionary’s size in “Hash Functions and Entropy”). This gives the following number of values which trigger a dictionary resize:

6; 18; 39; 81; 165; 333; 669; 1,341; 2,685; 5,373; 10,749; 21,501; 43,005; ...

It is important to note that resizing can happen to make a hash table larger or smaller. That is, if sufficiently many elements of a hash table are deleted, the table can be scaled down in size. However, resizing happens only during an insert.

Hash Functions and Entropy

Objects in Python are generally hashable, since they already have built-in __hash__ and __cmp__ functions associated with them. For numerical types (int and float), the hash is based simply on the bit value of the number they represent. Tuples and strings have a hash value that is based on their contents, so a tuple will be unhashable if it contains unhashable items. Lists, on the other hand, do not support hashing because their values can change. Since a list’s values can change, so could the hash that represents the list, which would change the relative placement of that key in the hash table.⁶

User-defined classes also have default hash and comparison functions. The default __hash__ function simply returns the object’s placement in memory as given by the built-in id function. Similarly, the __cmp__ operator compares the numerical value of the object’s placement in memory.

This is generally acceptable, since two instances of a class are generally different and should not collide in a hash table. However, in some cases we would like to use set or dict objects to disambiguate between items. Take the following class definition:

class Point(object):
    def __init__(self, x, y):
        self.x, self.y = x, y

If we were to instantiate multiple Point objects with the same values for x and y, they would all be independent objects in memory and thus have different placements in memory, which would give them all different hash values. This means that putting them all into a set would result in all of them having individual entries:

>>> p1 = Point(1,1)
>>> p2 = Point(1,1)
>>> set([p1, p2])
set([<__main__.Point at 0x1099bfc90>, <__main__.Point at 0x1099bfbd0>])
>>> Point(1,1) in set([p1, p2])
False

We can remedy this by forming a custom hash function that is based on the actual contents of the object as opposed to the object’s placement in memory. The hash function can be any function as long as it consistently gives the same result for the same object (there are also considerations regarding the entropy of the hashing function, which we will discuss later.) The following redefinition of the Point class will yield the results we expect:

class Point(object):
    def __init__(self, x, y):
        self.x, self.y = x, y

    def __hash__(self):
        return hash((self.x, self.y))

    def __eq__(self, other):
        return self.x == other.x and self.y == other.y

This allows us to create entries in a set or dictionary indexed by the properties of the Point object rather than the memory address of the instantiated object:

>>> p1 = Point(1,1)
>>> p2 = Point(1,1)
>>> set([p1, p2])
set([<__main__.Point at 0x109b95910>])
>>> Point(1, 1) in set([p1, p2])
True

As alluded to when we discussed hash collisions, a custom-selected hash function should be careful to evenly distribute hash values in order to avoid collisions. Having many collisions will degrade the performance of a hash table: if most keys have collisions, we need to constantly “probe” the other values, effectively walking a potentially large portion of the dictionary to find the key in question. In the worst case, when all keys in a dictionary collide, the performance of lookups in the dictionary is O(n) and thus the same as if we were searching through a list.

If we know that we are storing 5,000 values in a dictionary and we need to create a hashing function for the object we wish to use as a key, we must be aware that the dictionary will be stored in a hash table of size 16,384⁷ and thus only the last 14 bits of our hash are being used to create an index (for a hash table of this size, the mask is bin(16_384 - 1) = 0b11111111111111).

This idea of “how well distributed my hash function is” is called the entropy of the hash function. Entropy is defined as

S = - \sum_{i} p (i) \cdot log (p (i))

where p(i) is the probability that the hash function gives hash i. It is maximized when every hash value has equal probability of being chosen. A hash function that maximizes entropy is called an ideal hash function since it guarantees the minimal number of collisions.

For an infinitely large dictionary, the hash function used for integers is ideal. This is because the hash value for an integer is simply the integer itself! For an infinitely large dictionary, the mask value is infinite, and thus we consider all bits in the hash value. Therefore, given any two numbers, we can guarantee that their hash values will not be the same.

However, if we made this dictionary finite, we could no longer have this guarantee. For example, for a dictionary with four elements and thus size 8, the mask we use is 0b111. Thus the hash value for the number 5 is 5 & 0b111 = 5, and the hash value for 501 is 501 & 0b111 = 5, and so their entries will collide.

Note

To find the mask for a dictionary with an arbitrary number of elements, N, we first find the minimum number of buckets that dictionary must have to still be two-thirds full (N * (2 / 3 + 1)). Then we find the smallest dictionary size that will hold this number of elements (8; 32; 128; 512; 2,048; etc.) and find the number of bits necessary to hold this number. For example, if N=1039, then we must have at least 1,731 buckets, which means we need a dictionary with 2,048 buckets. Thus the mask is bin(2048 - 1) = 0b11111111111.

There is no single best hash function to use when using a finite dictionary. However, knowing up front what range of values will be used and how large the dictionary will be helps in making a good selection. For example, if we are storing all 676 combinations of two lowercase letters as keys in a dictionary (aa, ab, ac, etc.), a good hashing function would be the one shown in Example 4-6.

Example 4-6. Optimal two-letter hashing function

def twoletter_hash(key):
    offset = ord('a')
    k1, k2 = key
    return (ord(k2) - offset) + 26 * (ord(k1) - offset)

This gives no hash collisions for any combination of two lowercase letters, considering a mask of 0b1111111111 (a dictionary of 676 values will be held in a hash table of length 2,048, which has a mask of bin(2048 - 1) = 0b11111111111).

Example 4-7 very explicitly shows the ramifications of having a bad hashing function for a user-defined class—here, the cost of a bad hash function (in fact, it is the worst possible hash function!) is a 41.8× slowdown of lookups.

Example 4-7. Timing differences between good and bad hashing functions

import string
import timeit

class BadHash(str):
    def __hash__(self):
        return 42

class GoodHash(str):
    def __hash__(self):
        """
        This is a slightly optimized version of twoletter_hash
        """
        return ord(self[1]) + 26 * ord(self[0]) - 2619

baddict = set()
gooddict = set()
for i in string.ascii_lowercase:
    for j in string.ascii_lowercase:
        key = i + j
        baddict.add(BadHash(key))
        gooddict.add(GoodHash(key))

badtime = timeit.repeat(
    "key in baddict",
    setup = "from __main__ import baddict, BadHash; key = BadHash('zz')",
    repeat = 3,
    number = 1_000_000,
)
goodtime = timeit.repeat(
    "key in gooddict",
    setup = "from __main__ import gooddict, GoodHash; key = GoodHash('zz')",
    repeat = 3,
    number = 1_000_000,
)

print(f"Min lookup time for baddict: {min(badtime)}")
print(f"Min lookup time for gooddict: {min(goodtime)}")

# Results:
#   Min lookup time for baddict: 17.719061855008476
#   Min lookup time for gooddict: 0.42408075400453527

Dictionaries and Namespaces

Doing a lookup on a dictionary is fast; however, doing it unnecessarily will slow down your code, just as any extraneous lines will. One area where this surfaces is in Python’s namespace management, which heavily uses dictionaries to do its lookups.

Whenever a variable, function, or module is invoked in Python, there is a hierarchy that determines where it looks for these objects. First, Python looks inside the locals() array, which has entries for all local variables. Python works hard to make local variable lookups fast, and this is the only part of the chain that doesn’t require a dictionary lookup. If it doesn’t exist there, the globals() dictionary is searched. Finally, if the object isn’t found there, the __builtin__ object is searched. It is important to note that while locals() and globals() are explicitly dictionaries and __builtin__ is technically a module object, when searching __builtin__ for a given property, we are just doing a dictionary lookup inside its locals() map (this is the case for all module objects and class objects!).

To make this clearer, let’s look at a simple example of calling functions that are defined in different scopes (Example 4-8). We can disassemble the functions with the dis module (Example 4-9) to get a better understanding of how these namespace lookups are happening (see “Using the dis Module to Examine CPython Bytecode”).

Example 4-8. Namespace lookups

import math
from math import sin

def test1(x):
    """
    >>> %timeit test1(123_456)
    162 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    """
    res = 1
    for _ in range(1000):
        res += math.sin(x)
    return res

def test2(x):
    """
    >>> %timeit test2(123_456)
    124 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    """
    res = 1
    for _ in range(1000):
        res += sin(x)
    return res

def test3(x, sin=math.sin):
    """
    >>> %timeit test3(123_456)
    105 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    """
    res = 1
    for _ in range(1000):
        res += sin(x)
    return res

Example 4-9. Namespace lookups disassembled

>>> dis.dis(test1)
    ...cut..
             20 LOAD_GLOBAL              1 (math)
             22 LOAD_METHOD              2 (sin)
             24 LOAD_FAST                0 (x)
             26 CALL_METHOD              1
    ...cut..

>>> dis.dis(test2)
    ...cut...
             20 LOAD_GLOBAL              1 (sin)
             22 LOAD_FAST                0 (x)
             24 CALL_FUNCTION            1
    ...cut...

>>> dis.dis(test3)
    ...cut...
             20 LOAD_FAST                1 (sin)
             22 LOAD_FAST                0 (x)
             24 CALL_FUNCTION            1
    ...cut...

The first function, test1, makes the call to sin by explicitly looking at the math library. This is also evident in the bytecode that is produced: first a reference to the math module must be loaded, and then we do an attribute lookup on this module until we finally have a reference to the sin function. This is done through two dictionary lookups: one to find the math module and one to find the sin function within the module.

On the other hand, test2 explicitly imports the sin function from the math module, and the function is then directly accessible within the global namespace. This means we can avoid the lookup of the math module and the subsequent attribute lookup. However, we still must find the sin function within the global namespace. This is yet another reason to be explicit about what functions you are importing from a module. This practice not only makes code more readable, because the reader knows exactly what functionality is required from external sources, but it also simplifies changing the implementation of specific functions and generally speeds up code!

Finally, test3 defines the sin function as a keyword argument, with its default value being a reference to the sin function within the math module. While we still do need to find a reference to this function within the module, this is necessary only when the test3 function is first defined. After this, the reference to the sin function is stored within the function definition as a local variable in the form of a default keyword argument. As mentioned previously, local variables do not need a dictionary lookup to be found; they are stored in a very slim array that has very fast lookup times. Because of this, finding the function is quite fast!

While these effects are an interesting result of the way namespaces in Python are managed, test3 is definitely not “Pythonic.” Luckily, these extra dictionary lookups start to degrade performance only when they are called a lot (i.e., in the innermost block of a very fast loop, such as in the Julia set example). With this in mind, a more readable solution would be to set a local variable with the global reference before the loop is started. We’ll still have to do the global lookup once whenever the function is called, but all the calls to that function in the loop will be made faster. This speaks to the fact that even minute slowdowns in code can be amplified if that code is being run millions of times. Even though a dictionary lookup itself may take only several hundred nanoseconds, if we are looping millions of times over this lookup, those nanoseconds can quickly add up.

Note

A message about microbenchmarks: it may seem confusing that in Example 4-8 we add in extra work with the for loop and the modification to the res variable. Originally, each of these functions simply had the relevant return sin(x) line and nothing else. As a result, we were also getting nanosecond runtimes and results that did not make any sense!

When we added a bigger workload within each function, as done through the loop and the modification of the res variable, we started seeing the results we expected. With a bigger workload inside the function, we can be more sure that we are not measuring overhead from the benchmarking/timing process. In general, when you are running benchmarks and have a difference in timing in the nanoseconds, it’s important to sit back for a second and think through whether the experiment you are running is valid or whether you are measuring noise or unrelated timings as a result of instrumentation.

Wrap-Up

Dictionaries and sets provide a fantastic way to store data that can be indexed by a key. The way this key is used, through the hashing function, can greatly affect the resulting performance of the data structure. Furthermore, understanding how dictionaries work gives you a better understanding not only of how to organize your data but also of how to organize your code, since dictionaries are an intrinsic part of Python’s internal functionality.

In the next chapter we will explore generators, which allow us to provide data to code with more control over ordering and without having to store full datasets in memory beforehand. This lets us sidestep many of the possible hurdles that we might encounter when using any of Python’s intrinsic data structures.

¹ As we will discuss in “Hash Functions and Entropy”, dictionaries and sets are very dependent on their hash functions. If the hash function for a particular datatype is not O(1), any dictionary or set containing that type will no longer have its O(1) guarantee.

² A mask is a binary number that truncates the value of a number. So 0b1111101 & 0b111 = 0b101 = 5 represents the operation of 0b111 masking the number 0b1111101. This operation can also be thought of as taking a certain number of the least-significant digits of a number.

³ The discussion that led to this improvement can be found at https://oreil.ly/Pq7Lm.

⁴ The value of 5 comes from the properties of a linear congruential generator (LCG), which is used in generating random numbers.

⁵ Amortized analysis looks at the average complexity of an algorithm. This means that some inserts will be much more expensive, but on average, inserts will be O(1).

⁶ More information about this can be found at https://oreil.ly/g4I5-.

⁷ 5,000 values need a dictionary that has at least 8,333 buckets. The first available size that can fit this many elements is 16,384.

Get High Performance Python, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

High Performance Python, 2nd Edition by Micha Gorelick, Ian Ozsvald

Chapter 4. Dictionaries and Sets

Note

Example 4-1. Phone book lookup with a list

Note

Example 4-2. Phone book lookup with a dictionary

Tip

Example 4-3. Finding unique names with lists and sets

How Do Dictionaries and Sets Work?

Inserting and Retrieving

Example 4-4. Dictionary lookup sequence

Figure 4-1. The resulting hash table from inserting with collisions

Example 4-5. Custom hashing function

Deletion

Resizing

Hash Functions and Entropy

Note

Example 4-6. Optimal two-letter hashing function

Example 4-7. Timing differences between good and bad hashing functions

Dictionaries and Namespaces

Example 4-8. Namespace lookups

Example 4-9. Namespace lookups disassembled

Note

Wrap-Up

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly