Chapter 6. Reproducibility Design Patterns
Software best practices such as unit testing assume that if we run a piece of code, it produces deterministic output:
defsigmoid(x):return1.0/(1+np.exp(-x))classTestSigmoid(unittest.TestCase):deftest_zero(self):self.assertAlmostEqual(sigmoid(0),0.5)deftest_neginf(self):self.assertAlmostEqual(sigmoid(float("-inf")),0)deftest_inf(self):self.assertAlmostEqual(sigmoid(float("inf")),1)
This sort of reproducibility is difficult in machine learning. During training, machine learning models are initialized with random values and then adjusted based on training data. A simple k-means algorithm implemented by scikit-learn requires setting the random_state in order to ensure the algorithm returns the same results each time:
defcluster_kmeans(X):fromsklearnimportclusterk_means=cluster.KMeans(n_clusters=10,random_state=10)labels=k_means.fit(X).labels_[::]returnlabels
Beyond the random seed, there are many other artifacts that need to be fixed in order to ensure reproducibility during training. In addition, machine learning consists of different stages, such as training, deployment, and retraining. It is often important that some things are reproducible across these stages as well.
In this chapter, we’ll look at design patterns that address different aspects of reproducibility. The Transform design pattern captures data preparation dependencies from the model training pipeline to reproduce them during serving. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access