Chapter 6. Reproducibility Design Patterns
Software best practices such as unit testing assume that if we run a piece of code, it produces deterministic output:
def
sigmoid
(
x
):
return
1.0
/
(
1
+
np
.
exp
(
-
x
))
class
TestSigmoid
(
unittest
.
TestCase
):
def
test_zero
(
self
):
self
.
assertAlmostEqual
(
sigmoid
(
0
),
0.5
)
def
test_neginf
(
self
):
self
.
assertAlmostEqual
(
sigmoid
(
float
(
"-inf"
)),
0
)
def
test_inf
(
self
):
self
.
assertAlmostEqual
(
sigmoid
(
float
(
"inf"
)),
1
)
This sort of reproducibility is difficult in machine learning. During training, machine learning models are initialized with random values and then adjusted based on training data. A simple k-means algorithm implemented by scikit-learn requires setting the random_state
in order to ensure the algorithm returns the same results each time:
def
cluster_kmeans
(
X
)
:
from
sklearn
import
cluster
k_means
=
cluster
.
KMeans
(
n_clusters
=
10
,
random_state
=
10
)
labels
=
k_means
.
fit
(
X
)
.
labels_
[
:
:
]
return
labels
Beyond the random seed, there are many other artifacts that need to be fixed in order to ensure reproducibility during training. In addition, machine learning consists of different stages, such as training, deployment, and retraining. It is often important that some things are reproducible across these stages as well.
In this chapter, we’ll look at design patterns that address different aspects of reproducibility. The Transform design pattern captures data preparation dependencies from the model training pipeline to reproduce them during serving. ...
Get Machine Learning Design Patterns now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.