As an example, we will look at dataset consisting of anonymous transactions at a supermarket in Belgium. This dataset was made available by Tom Brijs at Hasselt University. Because of privacy concerns, the data has been anonymized, so we only have a number for each product, and each basket therefore consists of a set of numbers. The data file is available from several online sources (including this book's companion website).
We begin by loading the dataset and looking at some statistics (this is always a good idea):
from collections import defaultdict from itertools import chain # File is downloaded as a compressed file import gzip # file format is a line per transaction # of the form '12 34 342 5...' ...