Chapter 1. Sharing Information Across Disciplines in the Enterprise
Programs are meant to be read by humans and only incidentally for computers to execute.
Donald Knuth
This chapter will introduce you to the challenges of communicating ideas across multidisciplinary teams. While teams often have much in common in terms of skills and objectives, they may be composed of people from vastly different educational and cultural backgrounds, who bring different perspectives to bear on the same problem. In these environments, it is important to share information in a clear and consistent way. Notebooks provide an excellent way to do this, as they combine live code with formatted text so that programmers, data scientists, and even nontechnical members of the team can understand what is happening with various elements of the code being used.
The Overlap Between Data Scientist and Data Engineer
The modern data scientist on an enterprise team often has an intellectual ancestry in the academic world. The standard workflow in academic research is to measure something, compare the result to the predicted one, and report the findings in a peer-reviewed environment. The assumption in this environment is that “if you didn’t publish it, it didn’t happen,” which places a very heavy emphasis on careful documentation of work as the measure of success. It is not enough, however, to document and present your findings. As a data scientist, you must also be prepared to defend your position and persuade skeptics. This process requires diligence and determination if your idea is to be embraced.
On the other hand, for modern enterprise developers and engineers working in a fast-paced environment, the emphasis is on delivering code that provides the functionality required for the company’s success. The process of reporting findings is not typically as highly valued, and documentation is often considered a necessary evil that only the more diligent developer is committed to maintaining. Tracking progress is tied more to performance measures for time management than to explaining your reasoning and design choices. Furthermore, an aesthetic of compactness and brevity is more highly valued in a mature codebase. This more terse style, however, may be more difficult to read without additional documentation or explanation.
How, then, do we reconcile the two approaches in a coherent way? The data scientist may have a question about an algorithm that could affect performance, and will want to run tests. How do these tests translate into useful code? How does the data scientist persuade the development team that these tests open a path to a useful solution?
Conversely, how can an engineer or developer explain some of the more elegant but difficult-to-read pieces of code to a data scientist without creating unnecessarily verbose descriptions in the codebase?
Finally, how can management figure out what on earth their team is up to, beyond using a ticketing system (such as JIRA or GitHub Issues)?
Enter the notebook.
How Notebooks Bridge the Gap
The notebook is simply a collection of cells that run small snippets of code interactively. It also has features that allow you to display images, plot graphs, and display formatted text. This simple format allows you to provide a narrative structure around your code that enables you to describe the thinking behind it, while also providing all the necessary machinery to run it and explore the output. A simple notebook is shown in Figure 1-1.
The notebook environment has its heritage in more academic codes, including R, MATLAB, and Mathematica. Jupyter Notebook, for example, most closely resembles the Mathematica environment. In these environments, the intent is to provide a fully functional numerical engine that is exploratory in nature. Graphical capabilities and variable exploration are fundamental to the design. In the early stages of a notebook, the environment provides a low barrier to entry and the ability for users to quickly understand functions, numerical properties, and other questions that can be cumbersome to explore in debugger environments.
In a later phase of the notebook, the functionality shifts more decidedly from exploratory to expository. The notebook is no longer a scratch pad, but a means of communicating your idea, discovery, or algorithm to your coworkers. The graphs you were using to understand the data can now be used to make your case to the reader. The notebook can be presented as a static document or as a living piece of code that can be run and explored, even when connected to a cluster or multiple APIs.
Notebooks as a Medium of Communication
A well-written notebook or two in a large code repository provides a quick overview of what the code can do. While a README.md file often contains the details of installation and running a “Hello World” can verify the installation, a notebook can give deeper understanding of the code’s capabilities by providing formatted text and figures. A person who wants to quickly run examples can review a notebook tutorial, run the code, and be up to speed in minutes. This has important implications for code adoption. It is often said that adoption of code depends on whether the user can run examples within the first few minutes of interacting with it. If the code cannot be run within this time, the user may lose interest and never return.
The notebook is designed to engage not only the newcomer who wishes to learn to use the code, but also someone who is nontechnical but is nonetheless involved in evaluating the codebase. For these audiences, the persuasive element of the notebook is more critical, as the important ideas are illustrated with examples, descriptions, and figures.
Finally, the notebook may be used simply to educate members of your team on a particular algorithm that was developed and implemented. As shown in Figure 1-2, notebooks can be both generated and consumed by various members of the team in order to share interesting aspects of their work with one another.
Example: Validating Statistical Functions and Developing Unit Tests
For this example, we will be writing and running Python code. The code can be run either at the command line or in a notebook environment. The notebook for this example is provided at https://github.com/nilmeier/DSatEnterpriseScale/blob/master/multinomialSampler.ipynb.
Imagine that you are a data scientist who works in a data center, and you have been asked to help write code that will simulate process failures on a large cluster of computers. Each machine in your cluster will have a number of processes running, and each process will have an estimated failure probability. To create a sampler that will sample a number of failures over a time period, you determine that you will need to sample from a multinomial distribution. You want to also write this sampling procedure to run at a very large scale (both in number of processes and in time duration) and compute lots of statistics.
The multinomial distribution is given as:
After looking through the Apache Spark documentation, you learn that it is not implemented as one of the standard random number generators. Fortunately, it is not too difficult to implement (which may explain why it is not in the core library), and you can contribute to your team project by offering a multinomial sampler that can be used at scale.
While our example is somewhat simplified, it is very typical of a scale-up process. It is often the case that an algorithm that runs well at desktop scale will not be trivial to run at large scale. Fortunately, with a framework like Apache Spark, writing scalable code is not as difficult as you might expect.
Evaluating a Validated Desktop-Scale Function
Our example may have a real-world application, but we are going to use a more accessible example for the multinomial distribution: the six-sided die. The built-in function for generating samples from a multinomial distribution is np.random.multinomial
. The method is easy to call as shown here:
import
numpy
as
np
nTrials
=
500
# of rolls per round
nRounds
=
100
# of rounds
np
.
random
.
seed
(
10
)
numFaces
=
6
p
=
[
1
/
6.
]
*
numFaces
s
=
np
.
random
.
multinomial
(
nTrials
,
p1
,
size
=
nRounds
)
If you were running this at your terminal, you could verify that the first few samples are given as:
>> print(s[1:5]) [[70 86 92 91 65 96] [81 75 78 81 91 94] [78 91 91 89 73 78] [93 77 80 81 80 89]]
This is a pseudorandom sequence. It can be repeated if the same random seed is used. We can only truly understand the properties of this sampler, however, by taking many samples and observing the convergent behavior. How many rolls of the die does it take to converge to the expected roll frequency? What are the error bars? Sometimes these questions can be answered with theories, but they are often best answered by generating many samples.
We can generate these samples in a notebook and produce a plot like Figure 1-3. This plot shows the number of counts for each die value rolled, normalized by the total number of counts. The error bars are the standard deviation of the number of counts accumulated at each round, for a total of nRounds
. Each round consists of a number of trials, numTrials
, where an individual sample is drawn from the multinomial distribution.
The great thing about the notebook is that the plots and output data will stay cached in it for future reference; this means that we can refer to the static notebook without having to regenerate the data. For statistical sampling, this is particularly helpful, because we always rely on a large number of samples to improve our estimates.
Understanding the Logic to Be Used at Scale
We want to write our scalable function to improve our estimates. Let’s not be too hasty, however. We have a good understanding of the outputs of the NumPy built-in function, so let’s write our own function that returns values in the same way that the built-in function does. We can compare the outputs from the two functions to verify that our thinking is correct about how the sampler works. We don’t know exactly what this will look like in our scalable code (PySpark, Apache Spark’s Python API), but we do know that we will have a uniform random number generator at our disposal. So, let’s think about how we would write a multinomial sampler that leverages a uniform random number generator.
Referring to Figure 1-4, we can briefly describe the procedure. A random number is generated from a uniform random number generator. Each die face has some probability of being selected, so we compare our random number to the accumulated probability value computed for each die value to see which die face to select. The code for accomplishing this is as follows:
def
multinomialLocal
(
nTrials
,
p
,
size
):
nRounds
=
size
# using a more descriptive variable
xi
=
np
.
random
.
uniform
(
size
=
(
nTrials
,
nRounds
))
# computing the cumulative probabilities (cdf)
pcdf
=
np
.
zeros
(
numFaces
)
pcdf
[
0
]
=
p
[
0
]
for
i
in
range
(
1
,
numFaces
):
pcdf
[
i
]
=
p
[
i
-
1
]
+
pcdf
[
i
-
1
]
s
=
np
.
zeros
((
nRounds
,
numFaces
))
for
iTrial
in
range
(
nTrials
):
for
jRound
in
range
(
nRounds
):
index
=
np
.
where
(
pcdf
>=
xi
[
iTrial
,
jRound
])[
0
][
0
]
s
[
jRound
,
index
]
+=
1
return
s
While this is not too difficult to read, it is much easier to understand the intent of the function by looking at the figure, which explains with much more clarity the essential sampling algorithm. The important line that contains the codified version of the die face selection process is:
index
=
np
.
where
(
pcdf
>=
xi
[
iTrial
,
jRound
])[
0
][
0
]
The remainder of the function adds up these trials and stores them in a row of values. The output for a single round is an array that contains the count of the number of times each value (die face) is selected. This array should sum to the number of trials. As was the case with np.random.multinomial
, there will be an array of length nRounds
, with each round containing an array of numFaces
. We can now generate histogram plots and compare them to our NumPy function. We won’t be able to compare specific numbers, but we will be able to verify that the statistics are similar to within a sampling error (which will, of course, decrease for large sample sizes). In our notebook, we simply generate another histogram and verify visually that they are similar. In practice, there are many stricter methods for making this comparison more quantitative, which we will omit for the sake of brevity.
Generating a Unit Test with a Smaller Sample
Now that we know that our function has the right statistical properties, we can generate a much smaller pseudorandom example. A pseudorandom sequence will mimic a random sequence, but we can make it reproduceable by setting the random seed to the same value each time the code is run. This sequence will be more easily generated at compile time and a comparison can be made. We have made a detailed study of the algorithm’s properties in our notebook. We can now simply generate a pseudorandom sequence of numbers and compare them to a known sequence to validate the function. The larger, more time-intensive study will reside in the notebook, with all of the supporting data and explanations (much in the way that scientific code will refer to journal articles to explain the code).
>> np.random.seed(10) >> nTrialsUT = 2 >> nRoundsUT = 5 >> sLocal = multinomialLocal(nTrialsUT, p1, size=nRoundsUT) >> print(sLocal) [[ 0. 0. 0. 1. 1. 0.] [ 0. 0. 0. 0. 1. 1.] [ 1. 1. 0. 0. 0. 0.] [ 0. 0. 0. 1. 0. 1.] [ 0. 0. 0. 0. 2. 0.]]
Writing the Scalable Code
Now we’ve got our multinomial sampler working with some of the logic more exposed to us. We can now refer to it when writing our function in PySpark. We will describe Apache Spark in detail in later chapters, but for now you can think of it as a framework for writing code at scale, which simply means that it can be run on a cluster with many machines. If we can correctly write this in PySpark, it can be scaled to arbitrarily large sample sizes with no change in the way the code is written. We can verify that it works correctly at desktop scale by comparing it to our other validated functions. Additional work may be needed to validate it at scale, but we will have made the most difficult leap already, which is from the desktop to the cluster.
# Using Spark random number generator and
# accumulating statistics for a single round.
def
countsForSingleRound
(
numFaces
,
nTrials
,
seed
,
pcdf
):
s
=
np
.
zeros
(
numFaces
)
xi
=
pyspark
.
mllib
.
random
.
RandomRDDs
.
\uniformRDD
(
sc
,
nTrials
,
seed
=
seed
)
index
=
xi
.
map
(
lambda
x
:
np
.
where
(
pcdf
>=
x
)[
0
][
0
])
.
map
(
lambda
x
:
(
x
,
1
))
indexCounts
=
\np
.
array
(
index
.
reduceByKey
(
lambda
a
,
b
:
a
+
b
)
.
collect
())
# assigning counts to each location,
# accounting for the possiblity of zero counts in
# any particular value
for
i
in
indexCounts
:
s
[
i
[
0
]]
=
i
[
1
]
return
s
We start by writing a function to do the counts for a single round, which leverages the uniform random number generator in Spark, pyspark.mllib.random.RandomRDDs.uniformRDD
. From there the essential sampling piece has a slightly different syntax than in NumPy, as shown here:
index
=
xi
.
map
(
lambda
x
:
np
.
where
(
pcdf
>=
x
)[
0
][
0
])
.
map
(
lambda
x
:
(
x
,
1
))
This code will generate an RDD (resilient distributed dataset) of key/value tuples. The first element is the index number (die face), and the second is a number that will be accumulated in a downstream counting step (this value is used a lot in word counting algorithms on distributed systems). Now we can wrap another function around this that samples for multiple rounds:
def
multinomialSpark
(
nTrials
,
p
,
size
):
# setting Spark seed with numpy seed state:
sparkSeed
=
int
(
np
.
random
.
get_state
()[
1
][
0
]
)
nRounds
=
size
numFaces
=
len
(
p
)
pcdf
=
np
.
zeros
(
numFaces
)
# computing the cumulative probability function
pcdf
[
0
]
=
p
[
0
]
for
i
in
range
(
1
,
numFaces
):
pcdf
[
i
]
=
p
[
i
-
1
]
+
pcdf
[
i
-
1
]
s
=
np
.
zeros
((
nRounds
,
numFaces
))
# assume that nRounds is of reasonable size
# (nTrials can be very large).
# This means that Spark data types won't be needed.
for
iRound
in
range
(
nRounds
):
# each round is assigned a deterministic, unique seed
s
[
iRound
,
:]
=
countsForSingleRound
(
numFaces
,
nTrials
,
sparkSeed
+
iRound
,
pcdf
)
return
s
Our histograms here should also look pretty much the same for sufficiently large sample sizes. Once we have verified this, we can generate a pseudorandom sequence for unit testing:
>> np.random.seed(10) >> sSpark = multinomialSpark(nTrialsUT, p1, size = nRoundsUT) >> print(sSpark) >> print(sSpark[0:5]) [[ 1. 0. 1. 0. 0. 0.] [ 0. 0. 0. 1. 0. 1.] [ 0. 2. 0. 0. 0. 0.] [ 0. 0. 0. 1. 1. 0.] [ 0. 1. 1. 0. 0. 0.]]
These outputs can now be used to define a unit test. Unit tests are used to verify that a function (or method, depending on how it is written) is producing the correct outputs. For large codebases, they are a fundamental component that allows the developer to make sure that newly added functionality does not break other pieces of code.
These tests should be added as new functions are incorporated. In many cases, you can even write the test beforehand and use it as a recommendation for writing the function by enforcing input and output types as well as the expected content. This approach to coding is referred to as test-driven development (TDD), and can be a very efficient way to assign coding tasks to a team.
At the very least, TDD can be a nice way to concretely express your idea to those who are considering it for production code. The unit tests for the two functions discussed are given as follows, with the outputs extracted. Notice that the random seed assignment is critical to the reproduceablity of these functions.
import
unittest
class
TestMultinomialMethods
(
unittest
.
TestCase
):
# See
# http://localhost:8888/notebooks/multinomialScratch.ipynb
# for a detailed description
nTrials
=
2
nRounds
=
5
def
testMultinomialLocal
(
self
):
np
.
random
.
seed
(
10
)
p
=
[
1
/
6.
]
*
6
nTrials
=
2
nRounds
=
5
# reference data generated in notebook
# (preferably a GitHub link)
# http://localhost:8888/notebooks/multinomialScratch.ipynb
# Numpy-Unit-Test-Data
sLocalReference
=
np
.
array
([[
0.
,
1.
,
0.
,
0.
,
1.
,
0.
],
[
1.
,
1.
,
0.
,
0.
,
0.
,
0.
],
[
0.
,
0.
,
0.
,
1.
,
1.
,
0.
],
[
0.
,
1.
,
0.
,
0.
,
1.
,
0.
],
[
1.
,
0.
,
1.
,
0.
,
0.
,
0.
]])
sTest
=
multinomialLocal
(
nTrials
,
p
,
size
=
nRounds
)
np
.
testing
.
assert_array_equal
(
sTest
[
0
:
5
],
sLocalReference
)
def
testMultinomialSpark
(
self
):
np
.
random
.
seed
(
10
)
p
=
[
1
/
6.
]
*
6
nTrials
=
2
nRounds
=
5
# reference data generated in notebook:
# http://localhost:8888/notebooks/multinomialScratch.ipynb
#Spark-Unit-Test-Data
sSparkReference
=
np
.
array
([[
0.
,
0.
,
1.
,
0.
,
1.
,
0.
],
[
0.
,
0.
,
1.
,
0.
,
1.
,
0.
],
[
0.
,
0.
,
1.
,
0.
,
1.
,
0.
],
[
0.
,
0.
,
1.
,
0.
,
1.
,
0.
],
[
0.
,
0.
,
1.
,
0.
,
1.
,
0.
]])
sTest
=
multinomialSpark
(
nTrials
,
p
,
size
=
nRounds
)
np
.
testing
.
assert_array_equal
(
sTest
[
0
:
5
],
sSparkReference
)
We can run the unit test as follows to see how it looks:
>> suite = unittest.TestLoader().\ ... loadTestsFromTestCase(TestMultinomialMethods) >> unittest.TextTestRunner(verbosity = 2).run(suite) testMultinomialLocal (__main__.TestMultinomialMethods) ... ok testMultinomialSpark (__main__.TestMultinomialMethods) ... ok Ran 2 tests in 0.846s OK
Summary
The tests were a success, and so we can provide not only the function, but also a means for testing it. We now have a well-defined function that can be implemented and studied at scale. The development team can now review the notebook before it gets implemented in the codebase. From this point forward, we can focus on implementation issues rather than wonder if our algorithm is performing as expected.
In the next chapter, we will discuss the details of setting up your notebook environment. Once this is complete, you will be able to run all of the examples in the text and in the GitHub repository.
Get Data Science and Engineering at Enterprise Scale now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.