Chapter 6. Reproducibility Design Patterns

Software best practices such as unit testing assume that if we run a piece of code, it produces deterministic output:

def sigmoid(x):
    return 1.0 / (1 + np.exp(-x))
    
class TestSigmoid(unittest.TestCase):
    def test_zero(self):
        self.assertAlmostEqual(sigmoid(0), 0.5)

    def test_neginf(self):
        self.assertAlmostEqual(sigmoid(float("-inf")), 0)
        
    def test_inf(self):
        self.assertAlmostEqual(sigmoid(float("inf")), 1)

This sort of reproducibility is difficult in machine learning. During training, machine learning models are initialized with random values and then adjusted based on training data. A simple k-means algorithm implemented by scikit-learn requires setting the random_state in order to ensure the algorithm returns the same results each time:

def cluster_kmeans(X):
    from sklearn import cluster
    k_means = cluster.KMeans(n_clusters=10, random_state=10)
    labels = k_means.fit(X).labels_[::]
    return labels

Beyond the random seed, there are many other artifacts that need to be fixed in order to ensure reproducibility during training. In addition, machine learning consists of different stages, such as training, deployment, and retraining. It is often important that some things are reproducible across these stages as well.

In this chapter, we’ll look at design patterns that address different aspects of reproducibility. The Transform design pattern captures data preparation dependencies from the model training pipeline to reproduce them during serving. ...

Get Machine Learning Design Patterns now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.