Chapter 1. Probably Approximately Correct Software
If you’ve ever flown on an airplane, you have participated in one of the safest forms of travel in the world. The odds of being killed in an airplane are 1 in 29.4 million, meaning that you could decide to become an airline pilot, and throughout a 40-year career, never once be in a crash. Those odds are staggering considering just how complex airplanes really are. But it wasn’t always that way.
The year 2014 was bad for aviation; there were 824 aviation-related deaths, including the Malaysia Air plane that went missing. In 1929 there were 257 casualties. This makes it seem like we’ve become worse at aviation until you realize that in the US alone there are over 10 million flights per year, whereas in 1929 there were substantially fewer—about 50,000 to 100,000. This means that the overall probability of being killed in a plane wreck from 1929 to 2014 has plummeted from 0.25% to 0.00824%.
Plane travel changed over the years and so has software development. While in 1929 software development as we know it didn’t exist, over the course of 85 years we have built and failed many software projects.
Recent examples include software projects like the launch of healthcare.gov, which was a fiscal disaster, costing around $634 million dollars. Even worse are software projects that have other disastrous bugs. In 2013 NASDAQ shut down due to a software glitch and was fined $10 million USD. The year 2014 saw the Heartbleed bug infection, which made many sites using SSL vulnerable. As a result, CloudFlare revoked more than 100,000 SSL certificates, which they have said will cost them millions.
Software and airplanes share one common thread: they’re both complex and when they fail, they fail catastrophically and publically. Airlines have been able to ensure safe travel and decrease the probability of airline disasters by over 96%. Unfortunately we cannot say the same about software, which grows ever more complex. Catastrophic bugs strike with regularity, wasting billions of dollars.
Why is it that airlines have become so safe and software so buggy?
Writing Software Right
Between 1929 and 2014 airplanes have become more complex, bigger, and faster. But with that growth also came more regulation from the FAA and international bodies as well as a culture of checklists among pilots.
While computer technology and hardware have rapidly changed, the software that runs it hasn’t. We still use mostly procedural and object-oriented code that doesn’t take full advantage of parallel computation. But programmers have made good strides toward coming up with guidelines for writing software and creating a culture of testing. These have led to the adoption of SOLID and TDD. SOLID is a set of principles that guide us to write better code, and TDD is either test-driven design or test-driven development. We will talk about these two mental models as they relate to writing the right software and talk about software-centric refactoring.
SOLID
SOLID is a framework that helps design better object-oriented code. In the same ways that the FAA defines what an airline or airplane should do, SOLID tells us how software should be created. Violations of FAA regulations occasionally happen and can range from disastrous to minute. The same is true with SOLID. These principles sometimes make a huge difference but most of the time are just guidelines. SOLID was introduced by Robert Martin as the Five Principles. The impetus was to write better code that is maintainable, understandable, and stable. Michael Feathers came up with the mnemonic device SOLID to remember them.
SOLID stands for:
-
Single Responsibility Principle (SRP)
-
Open/Closed Principle (OCP)
-
Liskov Substitution Principle (LSP)
-
Interface Segregation Principle (ISP)
-
Dependency Inversion Principle (DIP)
Single Responsibility Principle
The SRP has become one of the most prevalent parts of writing good object-oriented code. The reason is that single responsibility defines simple classes or objects. The same mentality can be applied to functional programming with pure functions. But the idea is all about simplicity. Have a piece of software do one thing and only one thing. A good example of an SRP violation is a multi-tool (Figure 1-1). They do just about everything but unfortunately are only useful in a pinch.
Open/Closed Principle
The OCP, sometimes also called encapsulation, is the principle that objects should be open for extending but not for modification. This can be shown in the case of a counter object that has an internal count associated with it. The object has the methods increment
and decrement
. This object should not allow anybody to change the internal count unless it follows the defined API, but it can be extended (e.g., to notify someone of a count change by an object like Notifier
).
Interface Segregation Principle
The ISP is the principle that having many client-specific interfaces is better than a general interface for all clients. This principle is about simplifying the interchange of data between entities. A good example would be separating garbage, compost, and recycling. Instead of having one big garbage can it has three, specific to the garbage type.
Dependency Inversion Principle
The DIP is a principle that guides us to depend on abstractions, not concretions. What this is saying is that we should build a layer or inheritance tree of objects. The example Robert Martin explains in his original paper1 is that we should have a KeyboardReader
inherit from a general Reader
object instead of being everything in one class. This also aligns well with what Arthur Riel said in Object Oriented Design Heuristics about avoiding god classes. While you could solder a wire directly from a guitar to an amplifier, it most likely would be inefficient and not sound very good.
Note
The SOLID framework has stood the test of time and has shown up in many books by Martin and Feathers, as well as appearing in Sandi Metz’s book Practical Object-Oriented Design in Ruby. This framework is meant to be a guideline but also to remind us of the simple things so that when we’re writing code we write the best we can. These guidelines help write architectually correct software.
Testing or TDD
In the early days of aviation, pilots didn’t use checklists to test whether their airplane was ready for takeoff. In the book The Right Stuff by Tom Wolfe, most of the original test pilots like Chuck Yeager would go by feel and their own ability to manage the complexities of the craft. This also led to a quarter of test pilots being killed in action.2
Today, things are different. Before taking off, pilots go through a set of checks. Some of these checks can seem arduous, like introducing yourself by name to the other crewmembers. But imagine if you find yourself in a tailspin and need to notify someone of a problem immediately. If you didn’t know their name it’d be hard to communicate.
The same is true for good software. Having a set of systematic checks, running regularly, to test whether our software is working properly or not is what makes software operate consistently.
In the early days of software, most tests were done after writing the original software (see also the waterfall model, used by NASA and other organizations to design software and test it for production). This worked well with the style of project management common then. Similar to how airplanes are still built, software used to be designed first, written according to specs, and then tested before delivery to the customer. But because technology has a short shelf life, this method of testing could take months or even years. This led to the Agile Manifesto as well as the culture of testing and TDD, spearheaded by Kent Beck, Ward Cunningham, and many others.
The idea of test-driven development is simple: write a test to record what you want to achieve, test to make sure the test fails first, write the code to fix the test, and then, after it passes, fix your code to fit in with the SOLID guidelines. While many people argue that this adds time to the development cycle, it drastically reduces bug deficiencies in code and improves its stability as it operates in production.3
Airplanes, with their low tolerance for failure, mostly operate the same way. Before a pilot flies the Boeing 787 they have spent X amount of hours in a flight simulator understanding and testing their knowledge of the plane. Before planes take off they are tested, and during the flight they are tested again. Modern software development is very much the same way. We test our knowledge by writing tests before deploying it, as well as when something is deployed (by monitoring).
But this still leaves one problem: the reality that since not everything stays the same, writing a test doesn’t make good code. David Heinemer Hanson, in his viral presentation about test-driven damage, has made some very good points about how following TDD and SOLID blindly will yield complicated code. Most of his points have to do with needless complication due to extracting out every piece of code into different classes, or writing code to be testable and not readable. But I would argue that this is where the last factor in writing software right comes in: refactoring.
Refactoring
Refactoring is one of the hardest programming practices to explain to nonprogrammers, who don’t get to see what is underneath the surface. When you fly on a plane you are seeing only 20% of what makes the plane fly. Underneath all of the pieces of aluminum and titanium are intricate electrical systems that power emergency lighting in case anything fails during flight, plumbing, trusses engineered to be light and also sturdy—too much to list here. In many ways explaining what goes into an airplane is like explaining to someone that there’s pipes under the sink below that beautiful faucet.
Refactoring takes the existing structure and makes it better. It’s taking a messy circuit breaker and cleaning it up so that when you look at it, you know exactly what is going on. While airplanes are rigidly designed, software is not. Things change rapidly in software. Many companies are continuously deploying software to a production environment. All of that feature development can sometimes cause a certain amount of technical debt.
Technical debt, also known as design debt or code debt, is a metaphor for poor system design that happens over time with software projects. The debilitating problem of technical debt is that it accrues interest and eventually blocks future feature development.
If you’ve been on a project long enough, you will know the feeling of having fast releases in the beginning only to come to a standstill toward the end. Technical debt in many cases arises through not writing tests or not following the SOLID principles.
Having technical debt isn’t a bad thing—sometimes projects need to be pushed out earlier so business can expand—but not paying down debt will eventually accrue enough interest to destroy a project. The way we get over this is by refactoring our code.
By refactoring, we move our code closer to the SOLID guidelines and a TDD codebase. It’s cleaning up the existing code and making it easy for new developers to come in and work on the code that exists like so:
-
Follow the SOLID guidelines
-
Single Responsibility Principle
-
Open/Closed Principle
-
Liskov Substitution Principle
-
Interface Segregation Principle
-
Dependency Inversion Principle
-
-
Implement TDD (test-driven development/design)
-
Refactor your code to avoid a buildup of technical debt
Writing the Right Software
Writing the right software is much trickier than writing software right. In his book Specification by Example, Gojko Adzic determines the best approach to writing software is to craft specifications first, then to work with consumers directly. Only after the specification is complete does one write the code to fit that spec. But this suffers from the problem of practice—sometimes the world isn’t what we think it is. Our initial model of what we think is true many times isn’t.
Webvan, for instance, failed miserably at building an online grocery business. They had almost $400 million in investment capital and rapidly built infrastructure to support what they thought would be a booming business. Unfortunately they were a flop because of the cost of shipping food and the overestimated market for online grocery buying. By many measures they were a success at writing software and building a business, but the market just wasn’t ready for them and they quickly went bankrupt. Today a lot of the infrastructure they built is used by Amazon.com for AmazonFresh.
In theory, theory and practice are the same. In practice they are not.
Albert Einstein
We are now at the point where theoretically we can write software correctly and it’ll work, but writing the right software is a much fuzzier problem. This is where machine learning really comes in.
Writing the Right Software with Machine Learning
In The Knowledge-Creating Company, Nonaka and Takeuchi outlined what made Japanese companies so successful in the 1980s. Instead of a top-down approach of solving the problem, they would learn over time. Their example of kneading bread and turning that into a breadmaker is a perfect example of iteration and is easily applied to software development.
But we can go further with machine learning.
What Exactly Is Machine Learning?
According to most definitions, machine learning is a collection of algorithms, techniques, and tricks of the trade that allow machines to learn from data—that is, something represented in numerical format (matrices, vectors, etc.).
To understand machine learning better, though, let’s look at how it came into existence. In the 1950s extensive research was done on playing checkers. A lot of these models focused on playing the game better and coming up with optimal strategies. You could probably come up with a simple enough program to play checkers today just by working backward from a win, mapping out a decision tree, and optimizing that way.
Yet this was a very narrow and deductive way of reasoning. Effectively the agent had to be programmed. In most of these early programs there was no context or irrational behavior programmed in.
About 30 years later, machine learning started to take off. Many of the same minds started working on problems involving spam filtering, classification, and general data analysis.
The important shift here is a move away from computerized deduction to computerized induction. Much as Sherlock Holmes did, deduction involves using complex logic models to come to a conclusion. By contrast, induction involves taking data as being true and trying to fit a model to that data. This shift has created many great advances in finding good-enough solutions to common problems.
The issue with inductive reasoning, though, is that you can only feed the algorithm data that you know about. Quantifying some things is exceptionally difficult. For instance, how could you quantify how cuddly a kitten looks in an image?
In the last 10 years we have been witnessing a renaissance around deep learning, which alleviates that problem. Instead of relying on data coded by humans, algorithms like autoencoders have been able to find data points we couldn’t quantify before.
This all sounds amazing, but with all this power comes an exceptionally high cost and responsibility.
The High Interest Credit Card Debt of Machine Learning
Recently, in a paper published by Google titled “Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al. explained that machine learning projects suffer from the same technical debt issues outlined plus more (Table 1-1).
They noted that machine learning projects are inherently complex, have vague boundaries, rely heavily on data dependencies, suffer from system-level spaghetti code, and can radically change due to changes in the outside world. Their argument is that these are specifically related to machine learning projects and for the most part they are.
Instead of going through these issues one by one, I thought it would be more interesting to tie back to our original discussion of SOLID and TDD as well as refactoring and see how it relates to machine learning code.
Machine learning problem | Manifests as | SOLID violation |
---|---|---|
Entanglement |
Changing one factor changes everything |
SRP |
Hidden feedback loops |
Having built-in hidden features in model |
OCP |
Undeclared consumers/visibility debt |
ISP |
|
Unstable data dependencies |
Volatile data |
ISP |
Underutilized data dependencies |
Unused dimensions |
LSP |
Correction cascade |
* |
|
Glue code |
Writing code that does everything |
SRP |
Pipeline jungles |
Sending data through complex workflow |
DIP |
Experimental paths |
Dead paths that go nowhere |
DIP |
Configuration debt |
Using old configurations for new data |
* |
Fixed thresholds in a dynamic world |
Not being flexible to changes in correlations |
* |
Correlations change |
Modeling correlation over causation |
ML Specific |
SOLID Applied to Machine Learning
SOLID, as you remember, is just a guideline reminding us to follow certain goals when writing object-oriented code. Many machine learning algorithms are inherently not object oriented. They are functional, mathematical, and use lots of statistics, but that doesn’t have to be the case. Instead of thinking of things in purely functional terms, we can strive to use objects around each row vector and matrix of data.
SRP
In machine learning code, one of the biggest challenges for people to realize is that the code and the data are dependent on each other. Without the data the machine learning algorithm is worthless, and without the machine learning algorithm we wouldn’t know what to do with the data. So by definition they are tightly intertwined and coupled. This tightly coupled dependency is probably one of the biggest reasons that machine learning projects fail.
This dependency manifests as two problems in machine learning code: entanglement and glue code. Entanglement is sometimes called the principle of Changing Anything Changes Everything or CACE. The simplest example is probabilities. If you remove one probability from a distribution, then all the rest have to adjust. This is a violation of SRP.
Possible mitigation strategies include isolating models, analyzing dimensional dependencies,4 and regularization techniques.5 We will return to this problem when we review Bayesian models and probability models.
Glue code is the code that accumulates over time in a coding project. Its purpose is usually to glue two separate pieces together inelegantly. It also tends to be the type of code that tries to solve all problems instead of just one.
Whether machine learning researchers want to admit it or not, many times the actual machine learning algorithms themselves are quite simple. The surrounding code is what makes up the bulk of the project. Depending on what library you use, whether it be GraphLab, MATLAB, scikit-learn, or R, they all have their own implementation of vectors and matrices, which is what machine learning mostly comes down to.
OCP
Recall that the OCP is about opening classes for extension but not modification. One way this manifests in machine learning code is the problem of CACE. This can manifest in any software project but in machine learning projects it is often seen as hidden feedback loops.
A good example of a hidden feedback loop is predictive policing. Over the last few years, many researchers have shown that machine learning algorithms can be applied to determine where crimes will occur. Preliminary results have shown that these algorithms work exceptionally well. But unfortunately there is a dark side to them as well.
While these algorithms can show where crimes will happen, what will naturally occur is the police will start patrolling those areas more and finding more crimes there, and as a result will self-reinforce the algorithm. This could also be called confirmation bias, or the bias of confirming our preconceived notion, and also has the downside of enforcing systematic discrimination against certain demographics or neighborhoods.
While hidden feedback loops are hard to detect, they should be watched for with a keen eye and taken out.
LSP
Not a lot of people talk about the LSP anymore because many programmers are advocating for composition over inheritance these days. But in the machine learning world, the LSP is violated a lot. Many times we are given data sets that we don’t have all the answers for yet. Sometimes these data sets are thousands of dimensions wide.
Running algorithms against those data sets can actually violate the LSP. One common manifestation in machine learning code is underutilized data dependencies. Many times we are given data sets that include thousands of dimensions, which can sometimes yield pertinent information and sometimes not. Our models might take all dimensions yet use one infrequently. So for instance, in classifying mushrooms as either poisonous or edible, information like odor can be a big indicator while ring number isn’t. The ring number has low granularity and can only be zero, one, or two; thus it really doesn’t add much to our model of classifying mushrooms. So that information could be trimmed out of our model and wouldn’t greatly degrade performance.
You might be thinking why this is related to the LSP, and the reason is if we can use only the smallest set of datapoints (or features), we have built the best model possible. This also aligns well with Ockham’s Razor, which states that the simplest solution is the best one.
ISP
The ISP is the notion that a client-specific interface is better than a general purpose one. In machine learning projects this can often be hard to enforce because of the tight coupling of data to the code. In machine learning code, the ISP is usually violated by two types of problems: visibility debt and unstable data.
Take for instance the case where a company has a reporting database that is used to collect information about sales, shipping data, and other pieces of crucial information. This is all managed through some sort of project that gets the data into this database. The customer that this database defines is a machine learning project that takes previous sales data to predict the sales for the future. Then one day during cleanup, someone renames a table that used to be called something very confusing to something much more useful. All hell breaks loose and people are wondering what happened.
What ended up happening is that the machine learning project wasn’t the only consumer of the data; six Access databases were attached to it, too. The fact that there were that many undeclared consumers is in itself a piece of debt for a machine learning project.
This type of debt is called visibility debt and while it mostly doesn’t affect a project’s stability, sometimes, as features are built, at some point it will hold everything back.
Data is dependent on the code used to make inductions from it, so building a stable project requires having stable data. Many times this just isn’t the case. Take for instance the price of a stock; in the morning it might be valuable but hours later become worthless.
This ends up violating the ISP because we are looking at the general data stream instead of one specific to the client, which can make portfolio trading algorithms very difficult to build. One common trick is to build some sort of exponential weighting scheme around data; another more important one is to version data streams. This versioned scheme serves as a viable way to limit the volatility of a model’s predictions.
DIP
The Dependency Inversion Principle is about limiting our buildups of data and making code more flexible for future changes. In a machine learning project we see concretions happen in two specific ways: pipeline jungles and experimental paths.
Pipeline jungles are common in data-driven projects and are almost a form of glue code. This is the amalgamation of data being prepared and moved around. In some cases this code is tying everything together so the model can work with the prepared data. Unfortunately, though, over time these jungles start to grow complicated and unusable.
Machine learning code requires both software and data. They are intertwined and inseparable. Sometimes, then, we have to test things during production. Sometimes tests on our machines give us false hope and we need to experiment with a line of code. Those experimental paths add up over time and end up polluting our workspace. The best way of reducing the associated debt is to introduce tombstoning, which is an old technique from C.
Tombstones are a method of marking something as ready to be deleted. If the method is called in production it will log an event to a logfile that can be used to sweep the codebase later.
For those of you who have studied garbage collection you most likely have heard of this method as mark and sweep. Basically you mark an object as ready to be deleted and later sweep marked objects out.
Machine Learning Code Is Complex but Not Impossible
At times, machine learning code can be difficult to write and understand, but it is far from impossible. Remember the flight analogy we began with, and use the SOLID guidelines as your “preflight” checklist for writing successful machine learning code—while complex, it doesn’t have to be complicated.
In the same vein, you can compare machine learning code to flying a spaceship—it’s certainly been done before, but it’s still bleeding edge. With the SOLID checklist model, we can launch our code effectively using TDD and refactoring. In essence, writing successful machine learning code comes down to being disciplined enough to follow the principles of design we’ve laid out in this chapter, and writing tests to support your code-based hypotheses. Another critical element in writing effective code is being flexible and adapting to the changes it will encounter in the real world.
TDD: Scientific Method 2.0
Every true scientist is a dreamer and a skeptic. Daring to put a person on the moon was audacious, but through systematic research and development we have accomplished that and much more. The same is true with machine learning code. Some of the applications are fascinating but also hard to pull off.
The secret to doing so is to use the checklist of SOLID for machine learning and the tools of TDD and refactoring to get us there.
TDD is more of a style of problem solving, not a mandate from above. What testing gives us is a feedback loop that we can use to work through tough problems. As scientists would assert that they need to first hypothesize, test, and theorize, we can assert that as a TDD practitioner, the process of red (the tests fail), green (the tests pass), refactor is just as viable.
This book will delve heavily into applying not only TDD but also SOLID principles to machine learning, with the goal being to refactor our way to building a stable, scalable, and easy-to-use model.
The Plan for the Book
This book will cover a lot of ground with machine learning, but by the end you should have a better grasp of how to write machine learning code as well as how to deploy to a production environment and operate at scale. Machine learning is a fascinating field that can achieve much, but without discipline, checklists, and guidelines, many machine learning projects are doomed to fail.
Throughout the book we will tie back to the original principles in this chapter by talking about SOLID principles, testing our code (using various means), and refactoring as a way to continually learn from and improve the performance of our code.
Every chapter will explain the Python packages we will use and describe a general testing plan. While machine learning code isn’t testable in a one-to-one case, it ends up being something for which we can write tests to help our knowledge of the problem.
1 Robert Martin, “The Dependency Inversion Principle,” http://bit.ly/the-DIP.
2 Atul Gawande, The Checklist Manifesto (New York: Metropolitan Books), p. 161.
3 Nachiappan Nagappan et al., “Realizing Quality Improvement through Test Driven Development: Results and Experience of Four Industrial Teams,” Empirical Software Engineering 13, no. 3 (2008): 289–302, http://bit.ly/Nagappanetal.
4 H. B. McMahan et al., “Ad Click Prediction: A View from the Trenches.” In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, August 11–14, 2013.
5 A. Lavoie et al., “History Dependent Domain Adaptation.” In Domain Adaptation Workshop at NIPS ’11, 2011.
Get Thoughtful Machine Learning with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.