Chapter 1. Test-Driven Machine Learning

A great scientist is a dreamer and a skeptic. In modern history, scientists have made exceptional breakthroughs like discovering gravity, going to the moon, and producing the theory of relativity. All those scientists had something in common: they dreamt big. However, they didn’t accomplish their feats without testing and validating their work first.

Although we aren’t in the company of Einstein and Newton these days, we are in the age of big data. With the rise of the information age, it has become increasingly important to find ways to manipulate that data into something meaningful—which is precisely the goal of data science and machine learning.

Machine learning has been a subject of interest because of its ability to use information to solve complex problems like facial recognition or handwriting detection. Many times, machine learning algorithms do this by having tests baked in. Examples of these tests are formulating statistical hypotheses, establishing thresholds, and minimizing mean squared errors over time. Theoretically, machine learning algorithms have built a solid foundation. These algorithms have the ability to learn from past mistakes and minimize errors over time.

However, as humans, we don’t have the same rate of effectiveness. The algorithms are capable of minimizing errors, but sometimes we may not point them toward minimizing the right errors, or we may make errors in our own code. Therefore, we need tests for addressing human error, as well as a way to document our progress. The most popular way of writing these tests is called test-driven development (TDD). This method of writing tests first has become popularized as a best practice for programmers. However, it is a best practice that is sometimes not exercised in a development environment.

There are two good reasons to use test-driven development. One reason is that while TDD takes 15–35% more time in active development mode, it also has the ability to reduce bugs up to 90%. The second main reason to use TDD is for the benefit of documenting how the code is intended to work. As code becomes more complex, the need for a specification increases—especially as people are making bigger decisions based on what comes out of the analysis.

Harvard scholars Carmen Reinhart and Kenneth Rogoff wrote an economics paper stating that countries that took on debt of over 90% of their gross domestic product suffered sharp drops in economic growth. Paul Ryan cited this conclusion heavily in his presidential race. In 2013, three researchers from the University of Massachusetts found that the calculation was incorrect because it was missing a substantial number of countries from its analysis.

Some examples aren’t as drastic, but this case demonstrates the potential blow to one’s academic reputation due to a single error in the statistical analysis. One mistake can cascade into many more—and this is the work of Harvard researchers who have been through a rigorous process of peer review and have years of experience in research. It can happen to anybody. Using TDD would have helped to mitigate the risk of making such an error, and would have saved these researchers from the embarrassment.

History of Test-Driven Development

In 1999, Kent Beck popularized TDD through his work with extreme programming. TDD’s power comes from the ability to first define our intentions and then satisfy those intentions. The practice of TDD involves writing a failing test, writing the code that makes it pass, and then refactoring the original code. Some people call it “red-green-refactor” after the colors of many testing libraries. Red is writing a test that doesn’t work originally but documents what your goal is, while green involves making the code work so the test passes. Finally, you refactor the original code to work so that you are happy with its design.

Testing has always been a mainstay in the traditional development practice, but TDD emphasizes testing first instead of testing near the end of a development cycle. In a waterfall model, acceptance tests are used and involve many people—usually end users, not programmers—after the code is actually written. This approach seems good until coverage becomes a factor. Many times, quality assurance professionals test only what they want to test and don’t get to everything underneath the surface.

TDD and the Scientific Method

Part of the reason why TDD is so appealing is that it syncs well with people and their working style. The process of hypothesizing, testing, and theorizing makes it very similar to the scientific method.

Science involves trial and error. Scientists come up with a hypothesis, test that hypothesis, and then combine their hypotheses into a theory.


Hypothesize, test, and theorize could be called “red-green-refactor” instead.

Just as with the scientific method, writing tests first works well with machine learning code. Most machine learning practitioners apply some form of the scientific method, and TDD forces you to write cleaner and more stable code. Beyond its similarity to the scientific method, though, there are three other reasons why TDD is really just a subset of the scientific method: making a logical proposition of validity, sharing results through documentation, and working in feedback loops.

The beauty of test-driven development is that you can utilize it to experiment as well. Many times, we write tests first with the idea that we will eventually fix the error that is created by the initial test. But it doesn’t have to be that way: you can use tests to experiment with things that might not ever work. Using tests in this way is very useful for many problems that aren’t easily solvable.

TDD Makes a Logical Proposition of Validity

When scientists use the scientific method, they are trying to solve a problem and prove that it is valid. Solving a problem requires creative guessing, but without justification it is just a belief.

Knowledge, according to Plato, is a justified true belief and we need both a true belief and justification for that. To justify our beliefs, we need to construct a stable, logical proposition. In logic, there are two types of conditions to use for proposing whether something is true: necessary and sufficient conditions.

Necessary conditions are those without which our hypothesis fails. For example, this could be a unanimous vote or a preflight checklist. The emphasis here is that all conditions must be satisfied to convince us that whatever we are testing is correct.

Sufficient conditions, unlike necessary conditions, mean that there is enough evidence for an argument. For instance, thunder is sufficient evidence that lightning has happened because they go together, but thunder isn’t necessary for lightning to happen. Many times sufficient conditions take the form of a statistical hypothesis. It might not be perfect, but it is sufficient enough to prove what we are testing.

Together, necessary and sufficient conditions are what scientists use to make an argument for the validity of their solutions. Both the scientific method and TDD use these religiously to make a set of arguments come together in a cohesive way. However, while the scientific method uses hypothesis testing and axioms, TDD uses integration and unit tests (see Table 1-1).

Table 1-1. A comparison of TDD to the scientific method
  Scientific method TDD

Necessary conditions


Pure functional testing

Sufficient conditions

Statistical hypothesis testing

Unit and integration testing

Example: Proof through axioms and functional tests

Fermat famously conjectured in 1637 that “there are no positive integers a, b, and c that can satisfy the equation a^n + b^n = c^n for any integer value of n greater than two.” On the surface, this appears like a simple problem, and supposedly Fermat himself said he had a proof. Except the proof was too big for the margin of the book he was working out of.

For 358 years, this problem was toiled over. In 1995, Andrew Wiles solved it using Galois transformations and elliptic curves. His 100-page proof was not elegant but was sound. Each section took a previous result and applied it to the next step.

The 100 pages of proof were based on axioms or presumptions that had been proved before, much like a functional testing suite would have been done. In programming terms, all of those axioms and assertions that Andrew Wiles put into his proof could have been written as functional tests. These functional tests are just coded axioms and assertions, each step feeding into the next section.

This vacuum of testing in most cases doesn’t exist in production. Many times the tests we are writing are scattershot assertions about the code. In many cases, we are testing the thunder, not the lightning, to use our earlier example (i.e., our testing focuses on sufficient conditions, not necessary conditions).

Example: Proof through sufficient conditions, unit tests, and integration tests

Unlike pure mathematics, sufficient conditions are focused on just enough evidence to support a causality. An example is inflation. This mysterous force in economics has been studied since the 19th century. The problem with proving that inflation exists is that we cannot use axioms.

Instead, we rely on the sufficient evidence from our observations to prove that inflation exists. Based on our experience looking at economic data and separating out factors we know to be true, we have found that economies tend to grow over time. Sometimes they deflate as well. The existence of inflation can be proved purely on our previous observations, which are consistent.

Sufficient conditions like this have an analog to integration tests. Integration tests aim to test the overarching behavior of a piece of code. Instead of monitoring little changes, integration tests will watch the entire program and see whether the intended behavior is still there. Likewise, if the economy were a program we could assert that inflation or deflation exists.

TDD Involves Writing Your Assumptions Down on Paper or in Code

Academic institutions require professors to publish their research. While many complain that universities focus too much on publications, there’s a reason why: publications are the way research becomes timeless. If professors decided to do their research in solitude and made exceptional breakthroughs but didn’t publish, that research would be worthless.

Test-driven development is the same way: tests can be great in peer reviews as well as serving as a version of documentation. Many times, in fact, documentation isn’t necessary when TDD is used. Software is abstract and always changing, so if someone doesn’t document or test his code it will most likely be changed in the future. If there isn’t a test ensuring that the code operates a certain way, then when a new programmer comes to work on the software she will probably change it.

TDD and Scientific Method Work in Feedback Loops

Both the scientific method and TDD work in feedback loops. When someone makes a hypothesis and tests it, he finds out more information about the problem he’s investigating. The same is true with TDD; someone makes a test for what he wants and then as he goes through writing code he has more information as to how to proceed.

Overall, TDD is a type of scientific method. We make hypotheses, test them, and then revisit them. This is the same approach that TDD practitioners take with writing a test that fails first, finding the solution to it, and then refactoring that solution.

Example: Peer review

Peer review is common across many fields and formats, whether they be academic journals, books, or programming. The reason editors are so valuable is because they are a third party to a piece of writing and can give objective feedback. The counterpart in the scientific community is peer reviewing journal articles.

Test-driven development is different in that the third party is a program. When someone writes tests, the program codes the assumptions and requirements and is entirely objective. This feedback can be valuable for the programmer to test assumptions before someone else looks at the code. It also helps with reducing bugs and feature misses.

This doesn’t mitigate the inherent issues with machine learning or math models; rather, it just defines the process of tackling problems and finding a good enough solution to them.

Risks with Machine Learning

While the scientific method and TDD are a good start to the development process, there are still issues that we might come across. Someone can follow the scientific method and still have wrong results; TDD just helps us create better code and be more objective. The following sections will outline some of these more commonly encountered issues with machine learning:

  • Unstable data

  • Underfitting

  • Overfitting

  • Unpredictable future

Unstable Data

Machine learning algorithms do their best to avoid unstable data by minimizing outliers, but what if the errors were our own fault? If we are misrepresenting what is correct data, then we will end up skewing our results.

This is a real problem considering the amount of incorrect information we may have. For example, if an application programming interface (API) you are using changes from giving you 0 to 1 binary information to –1 to 1, then that could be detrimental to the output of the model. We might also have holes in a time series of data. With this instability, we need a way of testing for data issues to mitigate human error.


Underfitting is when a model doesn’t take into account enough information to accurately model real life. For example, if we observed only two points on an exponential curve, we would probably assert that there is a linear relationship there (Figure 1-1). But there may not be a pattern, because there are only two points to reference.

small exponential curve
Figure 1-1. In the range of –1 to 1 a line will fit an exponential curve well

Unfortunately, though, when you increase the range you won’t see nearly as clear results, and instead the error will drastically increase (Figure 1-2).

big exponential curve
Figure 1-2. In the range of -20 to 20 a linear line will not fit an exponential curve at all

In statistics, there is a measure called power that denotes the probability of not finding a false negative. As power goes up, false negatives go down. However, what influences this measure is the sample size. If our sample size is too small, we just don’t have enough information to come up with a good solution.


While too little of a sample isn’t ideal, there is also some risk of overfitting data. Using the same exponential curve example, let’s say we have 300,00 data points. Overfitting the model would be building a function that has 300,000 operators in it, effectively memorizing the data. This is possible, but it wouldn’t perform very well if there were a new data point that was out of that sample.

It seems that the best way to mitigate underfitting a model is to give it more information, but this actually can be a problem as well. More data can mean more noise and more problems. Using too much data and too complex of a model will yield something that works for that particular data set and nothing else.

Unpredictable Future

Machine learning is well suited for the unpredictable future, because most algorithms learn from new information. But as new information is found, it can also come in unstable forms, and new issues can arise that weren’t thought of before. We don’t know what we don’t know. When processing new information, it’s sometimes hard to tell whether our model is working.

What to Test for to Reduce Risks

Given the fact that we have problems such as unstable data, underfitted models, overfitted models, and uncertain future resiliency, what should we do? There are some general guidelines and techniques, known as heuristics, that we can write into tests to mitigate the risk of these issues arising.

Mitigate Unstable Data with Seam Testing

In his book Working Effectively with Legacy Code (Prentice Hall), Michael Feathers introduces the concept of testing seams when interacting with legacy code. Seams are simply the points of integration between parts of a code base. In legacy code, many times we are given a piece of code where we don’t know what it does internally but can predict what will happen when we feed it something. Machine learning algorithms aren’t legacy code, but they are similar. As with legacy code, machine learning algorithms should be treated like a black box.

Data will flow into a machine learning algorithm and flow out of the algorithm. We can test those two seams by unit testing our data inputs and outputs to make sure they are valid within our given tolerances.

Example: Seam testing a neural network

Let’s say that you would like to test a neural network. You know that the data that is yielded to a neural network needs to be between 0 and 1 and that in your case you want the data to sum to 1. When data sums to 1, that means it is modeling a percentage. For instance, if you have two widgets and three whirligigs, the array of data would be 2/5 widgets and 3/5 whirligigs. Because we want to make sure that we are feeding only information that is positive and adds up to 1, we’d write the following test in our test suite:

it 'needs to be between 0 and 1' do
  @weights = NeuralNetwork.weights
  @weights.each do |point|

it 'has data that sums up to 1' do
  @weights = NeuralNetwork.weights
  @weights.reduce(&:+).must_equal 1

Seam testing serves as a good way to define interfaces between pieces of code. While this is a trivial example, note that the more complex the data gets, the more important these seam tests are. As new programmers touch the code, they might not know all the intricacies that you do.

Check Fit by Cross-Validating

Cross-validation is a method of splitting all of your data into two parts: training and validation (see Figure 1-3). The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model.


Training is special to the machine learning world. Because machine learning algorithms aim to map previous observations to outcomes, training is essential. These algorithms learn from data that has been collected, so without an initial set to train on, the algorithm would be useless.

Swapping training with validation helps increase the number of tests. You would do this by splitting the data into two; the first time you’d use set 1 to train and set 2 to validate, and then you’d swap them for the second test. Depending on how much data you have, you could split the data into smaller sets and cross-validate that way. If you have enough data, you could split cross-validation into an indefinite amount of sets.

In most cases, people decide to split validation and training data in half—one part to train the model and the other to validate that it works with real data. If, for instance, you are training a language model that tags many parts of speech using a Hidden Markov Model, you want to minimize the error of the model.

point of optimal complexity
Figure 1-3. Our real goal is to minimize the cross-validated error or real error rate

Example: Cross-validating a model

From our trained model we might have a 5% error rate, but when we introduce data outside of the model, that error might skyrocket to something like 15%. That is why it’s important to use a data set that is separate; this is as essential to machine learning as double-entry accounting is to accounting. For example:

def compare(network, text_file)
  misses = 0
  hits = 0

  sentences.each do |sentence|
    if == sentence.classification
      hits += 1
      misses += 1

  assert misses < (0.05 * (misses + hits))
def test_first_half
  compare(first_data_set, second_data_set)

def test_second_half
  compare(second_data_set, first_data_set)

This method of first splitting data into two sets eliminates common issues that might happen as a result of improper parameters on your machine learning model. It’s a great way of finding issues before they become a part of any code base.

Reduce Overfitting Risk by Testing the Speed of Training

Occam’s Razor emphasizes simplicity when modeling data, and states that the simpler solution is the better one. This directly implies “don’t overfit your data.” The idea that the simpler solution is the better one has to do with how overfitted models generally just memorize the data given to them. If a simpler solution can be found, it will notice the patterns versus parsing out the previous data.

A good proxy for complexity in a machine learning model is how fast it takes to train it. If you are testing different approaches to solving a problem and one takes 3 hours to train while the other takes 30 minutes, generally speaking the one that takes less time to train is probably better. The best approach would be to wrap a benchmark around the code to find out if it’s getting faster or slower over time.

Many machine learning algorithms have max iterations built into them. In the case of neural networks, you might set a max epoch of 1,000 so that if the model isn’t trained within 1,000 iterations, it isn’t good enough. An epoch is just a measure of one iteration through all inputs going through the network.

Example: Benchmark testing

To take it a step further, you can also use unit testing frameworks like MiniTest. This adds computational complexity and an IPS (iterations per second) benchmark test to your test suite so that the performance doesn’t degrade over time. For example:

it 'should not run too much slower than last time' do
  bm = Benchmark.measure do'sentence')
  bm.real.must_be < (time_to_run_last_time * 1.2)

In this case, we don’t want the test to run more than 20% over what it did last time.

Monitor for Future Shifts with Precision and Recall

Precision and recall are ways of monitoring the power of the machine learning implementation. Precision is a metric that monitors the percentage of true positives. For example, a precision of 4/7 would mean that 4 were correct out of 7 yielded to the user. Recall is the ratio of true positives to true positive plus false negatives. Let’s say that we have 4 true positives and 9; in that case, recall would be 4/9.

User input is needed to calculate precision and recall. This closes the learning loop and improves data over time due to information feeding back after being misclassified. Netflix, for instance, illustrates this by displaying a star rating that it predicts you’d give a certain movie based on your watch history. If you don’t agree with it and rate it differently or indicate you’re not interested, Netflix feeds that back into its model for future predictions.


Machine learning is a science and requires an objective approach to problems. Just like the scientific method, test-driven development can aid in solving a problem. The reason that TDD and the scientific method are so similar is because of these three shared characteristics:

  • Both propose that the solution is logical and valid.

  • Both share results through documentation and work over time.

  • Both work in feedback loops.

But while the scientific method and test-driven development are similar, there are some issues specific to machine learning:

  • Unstable data

  • Underfitting

  • Overfitting

  • Unpredictable future

Fortunately, these challenges can be mitigated through the heuristics shown in Table 1-2.

Table 1-2. Heuristics to mitigate machine learning risks
Problem/risk Heuristic

Unstable data

Seam testing




Benchmark testing (Occam’s Razor)

Unpredictable future

Precision/recall tracking over time

The best part is that you can write and think about all of these heuristics before writing actual code. Test-driven development, like the scientific method, is valuable as a way to approach machine learning problems.

Get Thoughtful Machine Learning now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.