Can I start with a confession? I love algorithms. Even more than any specific algorithm, I love the idea of algorithms: the idea that interrelationships in the real world can be expressed and clarified using formulas comforts me. It gives assurance that the chaotic world is, in fact, governed by foundational principles, so much so that if we have all the data and limitless resources for building models, we’ll eventually be able to find the way of the universe and beyond.
Does that sound weird, even scary? Well it should! It has recently been suggested that the word mathematism simply means, “The ideological interpretation of mathematical conclusions.” In my experience, that kind of absolute thinking runs contrary to finding real-world solutions. In the mathematism paradigm, math becomes some sort of categorical, absolute truth. Sometimes, mathematicians perceive anyone who can’t follow their reasoning as lost mathematical “heathens.” Perhaps even worse, a blinding faith in mathematics can limit a person’s critical eye. This blind faith is precisely what any kind of science needs to avoid.
The intersection of innovation and interpretability
Data scientists need to avoid any smug comfort that found results are absolute truths or anything but another (albeit important) piece of the strategic consideration puzzle. Data scientists need to be humble with conclusions and transparent in explanations. Ninety percent of university students who take a statistics course are forced to do so and hate the experience (I made that number up, challenge me if you think it’s wrong). This cannot be simply blamed on the students. If math “heathens” are to follow our lead, then we need to make our analyses clear and concise. At the same time, we need to constantly and thoroughly question our own results.
That’s precisely why I found Patrick Hall’s recent post on balancing accuracy with interpretability such an excellent introduction to the topic of modern machine learning. We need to pursue algorithm innovation while continually making the interpretation of our analyses more digestible for an ever-growing variety of consumers. Those are two very important objectives in modern analytics, either side of which we cannot afford to miss out on. Let me start with the algorithms.
Algorithmic innovation is all about pushing the limits of what cutting-edge nonlinear algorithms can do with sparse data. Predicting rare events is all too often the holy grail of predictive modeling and, unfortunately, easily interpretable linear models often fail in this task. Let me give an example.
Let’s say an ambitious data scientist wants to use Learning Vector Quantization (LVQ) to develop an automatic classification system for pharmaceutical research texts residing in the PubMed database. Even if the results are fantastic, generating 80% accuracy for placing those documents into relevant categories, non-believers are still going to focus on the 20% mistakes. As tempting as it may be, the data scientist cannot afford to simply throw up his or her arms in frustration at the lack of comprehension amongst the “heathens.” Having 80% accuracy as an intelligent filter to get researchers to the right document more quickly is just too big of a process improvement to throw away.
It is crucial for the analyst to make their models transparent. Toward making the model in the previous case clear, the analyst would want to answer the following questions:
- What are the key concepts being derived in the textual data preparation? Explain this through the use of key weighted words in those derived concepts.
- How do those derived concepts interact at a simple level to gradually increase classification accuracy? Show some scatterplots indicating correlations of certain key concepts.
- How does the model find complex interactions to score? LVQ derives a “winner takes all” model, which can be first interpreted and then explained for non-experts.
Even after this step-by-step process has been duly explained, the “heathens” might not be converted, but they may be more open to accepting the process improvement that the algorithm enables.
The need for creating, and recreating, digestible results is precisely why the notebook has become the interface of choice for data scientists. Data scientists are no strangers to code. They love the flexibility it brings. But a successful data science pipe contains a complete storyline, not just score code. By including a range of explanatory graphs throughout the analysis — and then complementing those graphs with text, pictures, and other means of explanation—a data science notebook not only gives a complete script for the data scientist to explain their analysis, but gives full clarity for any moderate-level user to reproduce the analysis and guide the conclusions (eventually showing where errors may have been made as well).
But why stop there? I would contend that the quest for mainstream machine learning goes further than simpler explanation. In order to better understand the results of complex algorithms, the “heathens” need to develop a more intuitive understanding of the underlying analytics. This intuitive understanding can only be developed by one way: Practice! Enter the world of approachable analytics.
Approachable analytics consist of intuitive, convenient interfaces that guide non-expert users to interactively explore data through 1-2-3 graphs, basic statistics, and even prototyping advanced predictive modeling techniques. Not only does this help business analysts develop an intuitive understanding of what machine learning does, these new “citizen data scientists” can also share their results with teammates collaboratively. I would think that empowering the business analysts to become citizen data scientists can even lead to types of lazy learning with the less adept analysts confirming the results of the power data scientists.
Let’s use another example here. I once worked on a project with a major beer brewer that wanted to combine lab, production, and controller data to detect potential early warnings for batches with a high chance of going bad. After a few weeks of analysis, we found some clear links between outside temperature, humidity in the hops, and personnel who seemed to create a situation where loss was more likely. The problem with the discovery process was that I was not a beer (brewing) expert. I went down a lot of dead-end roads in the data that a brewing expert would not have wasted time on. If I could have had an army of brewers (or even just one) interactively exploring the data for me, I could get to the “money” data much more quickly. Innovation in business units really means getting all ranges of expertise with data, algorithms, and business domain knowledge involved in the data discovery process.
It’s the data, stupid
Finally, unquestioning faith in math can be trumped in terms of peril by another absolutism: dataism. Assuming the data is always right is another dogmatism to which only the most naïve data scientists fall prey. Setting up robust data governance practices is key to confidence that when machine learning is used, the underlying data is consistent and complete. See my short webcast on the topic for more insights.
Machine learning inevitably starts with data, algorithms, and know-how. However, to enable organizations to get the most return on machine learning, analytics need to be approachable. By realizing this, we’ll successfully conquer our own tendencies toward extreme mathematics.
To get started on machine learning best practices, have a look at some tables Patrick Hall made publically available on GitHub here.
To read more about the increasing use of machine learning in business, download our free report “The Evolution of Analytics: Opportunities and Challenges for Machine Learning in Business.”
This post is a collaboration between O'Reilly and SAS. See our statement of editorial independence.