Chapter 1. Introduction

Telling Better Stories with Data

Not enough gets said about abandoning crap.

Ira Glass

We’ve all seen them. The intimidating PowerPoint presentations with the army of bullet points marching down the screen. Often the lecturer will even apologize for the busy slide and then continue to present, reading every word on the slide exactly as printed. You start to wonder if you left the oven on last night. We all like stories. A well-constructed narrative in the form of a movie, book, television show, or podcast wraps around us like a blanket and draws our attention. The bullet-ridden PowerPoint…not so much. With the deluge of data that has come with the advent of the internet and IoT, we are tempted to splash some findings in a presentation, wipe our hands, and say “that is that.” However, as data professionals we can’t just rain data findings down on our audience. The prevailing advice is that you must tell a story with data—make sure it’s a compelling story that people want to hear. Don’t deny yourself the joy that storytelling can bring.

To tell a compelling story, you must identify it. What is being asked of my data? What insights are my users looking for? A company that specializes in providing services and equipment might ask, “What equipment needs servicing the most? The least? Is there a correlation between equipment type and parts replacement?” At that same company someone in the finance department might ask, “How can we more accurately predict cash-on-hand?” In sales the question might be, “What kind of customer churn do I have?”

After you’ve identified your story, you’ll need to find your audience. There are many ways to break them down, but generally your audience includes executives, business professionals, and technical professionals. While they might manage or direct many business processes, executives often know little about the daily functioning of such processes. The detail is irrelevant (or possibly confusing) to them—they want to know the story in big bold letters. Business professionals are the daily administrators of a business process, such as super users and business analysts. They know the process in detail and can understand raw tabular data. Technical professionals are the smallest segment of your audience; they usually comprise colleagues in data analytics and data science teams. This group requires less business and process background and more technical details such as the root-mean-squared error of the regression or the architecture of the neural network.

Once you’ve got your story and audience set, you’ll need to move forward with the most difficult and tenuous part of the journey: finding the data. Without the data to support your story, your journey will quickly come to an end. Let’s say you wanted to tell the story of how sunspots correlate to sales of hats and mittens in the northern hemisphere. Surprisingly, sunspot data is easy to obtain. You got that. However, you only have details on sales of hats, not mittens. You can’t find that data. A cautious step is needed here. Do you alter your story to fit the data or do you cut bait and find another story? Reversing the process can be done but it’s a slippery slope. As a general rule, do not change your hypothesis to match your data.

Before you fully trust that data, you’ll need to vet it and start asking a lot of questions:

Is the source reputable? Did you scrape the data from a table on a website? What sources did that website use for the data, and how was it obtained? Sources such as, ProPublica, the US Census Bureau, and GapMinder are trustworthy, but others might need a dash of caution.

Do you have too much data? Are there easily recognizable, worthless features? Look for features that are obviously precisely correlated. In the sunspot data mentioned earlier, perhaps you have a UTC timestamp feature and two other features for date and time. Either the date and time should be thrown out or the timestamp. You can quickly look at correlations using techniques we will discuss later to help you identify when two features are too closely correlated for both of them to be useful.

Is the data complete? Use some preliminary data tools to make sure your data is not missing too much information. We’ll discuss this process in more detail later.

With the story in place, the audience identified, and the data vetted, what’s next? You’re now ready for the art and fun of the story—identifying what tools to use to either support or reject your null hypothesis. To say you’re using “data science” as a tool is a slippery slope. You have advanced reporting, machine learning, and deep learning in your arsenal. Often, just the organization of the data into an easy-to-use dashboard tells the whole story. Nothing more needs to be done. As deflating as that has been in our careers, it has happened more times than any other scenario. We start the journey thinking that we have a case for a recursive neural network with either a gated recurrent unit or a long short-term memory module. And the excitement builds while we’re gathering the data. Then we realize a support vector machine or a simple regression would do just as well. Later, with not a little disappointment, we realize that a dashboard for users to explore the data is more than enough to tell the story. Not everything requires deep or even machine learning. Although it can often be entertaining, shoehorning your story into these paradigms often does not tell the story any better.

Finally, take a little time to learn a bit about the art of storytelling. Even our dry data science stories deserve some love and attention. Ira Glass is a fantastic storyteller. He has a series of four short videos on the art of storytelling. Watch them and sprinkle some of his sage advice into your story.

A Quick Look: Data Science for SAP Professionals

SAP professionals are busy every day supporting the business and users, constantly looking for process improvements. They gather requirements, configure or code in the SAP system, and, more often than not, live in the SAP GUI. They have intimate knowledge of the data within SAP as well as the business processes and can summon an army of transaction codes like incantations. When asked for a report with analytics, they really have two options: code the report in SAP or push the data to a data warehouse where someone else will generate the report. Both of these processes are typically long, resource-intensive endeavors that lead to frustration for the end user and the SAP professional. For one particular client, the biggest complaint from the SAP users was that by the time they actually got a requested report, it was no longer relevant.

Reading this book will help you—the SAP professional—build a bridge between the worlds of the business professional and the data scientist. Within these pages you will find ideas for getting out of the typical reporting and/or analytics methodology that has hitherto been so restrictive. As we discussed earlier, one of the first ways to do that is to simply ask better questions.

Here’s a typical SAP scenario: Cindy works in Accounts Receivable. She needs a 30-60-90 day overdue report listing past due customers and putting them into buckets according to whether they are 30 days, 60 days, or 90 days past due. Sharon in Finance gets the request and knows that she can have a standard ALV (ABAP List Viewer) report created or can extract the data and push it to a business warehouse (BW) where they will generate a report using Microstrategy or whatever tools they have.

What if we shifted Sharon’s perspective to that of a data scientist? Sharon gets the report request. She knows she can deliver just what was requested, but then she thinks, “What more can be done?” She opens up a notepad and jots down some ideas.

Are there repeat offenders in late payments?

Are there any interesting correlations in the data? We know the customer name, customer payment history, customer purchases, and dollar amount.

Can we predict when a person will be paying late? How late?

Can we use this data to help rate our customers? Lower rated customers may not get an order when inventory is low and a higher rated customer also makes the same request.

What types of visualizations would be helpful?

Sharon sketches out an interactive dashboard report that she thinks would be very useful for her users. Armed with these ideas and sketches, Sharon asks the department data scientist (or SAP developer) about the possibilities.

There is a distinctive difference in approaches here. The first is a typical SAP response, and limits the creative and intellectual capacity of the business analysts. The second leverages their creativity. Sharon won’t just provide the requested information. When she sees the data in SAP and asks better questions, she’ll be instrumental in substantial process improvements.

This is just one example. Think of the possibilities with all the requests a typical SAP team gets, and hence this book!

Another way to shift the thinking of the SAP team to be more dynamic and data centric is to use better tools. This is the responsibility of the SAP developer. Most SAP developers live in the world of its application programming language called ABAP (Advanced Business Application Programming), and when asked to provide reports or process improvements turn instantly to the SAP GUI or Eclipse. This is where they’re expected to spend time and deliver value.


ABAP was originally Allgemeiner Berichts-Aufbereitungs-Prozessor. It’s a server-side language specially designed to extend the core functionality of SAP. You can create programs that display reports, run business transactions or ingest outside system data and integrate it into SAP. A great deal of SAP ERP transactions run solely on ABAP code.

ABAP developers often specialize in one or more of the business functions that SAP provides. Since ABAP programs often directly enhance standard SAP features, ABAP developers become very familiar with how enterprises design their processes. It’s very common for people familiar with ABAP to perform both technical programming roles and business analyst roles.


SAP developers, we implore you: view SAP as a data source. The presentation layer and logic layer of reports should be abstracted away from the database layer (see Figure 1-1). It is worth noting that SAP data is highly structured with strict business rules. One of the most obvious advantages to this approach is the logic layer has access to other sources of data, such as public data. Within an SAP system, if a request was made to view the correlations between sales of galoshes and weather patterns, the weather data from the NOAA would have to be brought into either BI or SAP itself. However, by using a tiered model the data can be accessed by the logic tier and presented in the presentation layer. Often the data may be an API, which allows for access without storage. This model also allows the logic tier to tie into tools like Azure Machine Learning Studio to perform machine or deep learning on the SAP data.

Figure 1-1. A simple, layered approach to databases, logic, and presentation of data science findings

SAP lacks the thousands of libraries in Python or the thousands of packages in R.1 It also lacks the ability to easily create dynamic/interactive dashboards and visualizations. Don’t get us wrong: SAP does have tools to do advanced analytics, dashboards, and visualizations. It’s just that they cost a lot of money, effort, and time. Some places have lead times measured in months or quarters before reports can be created, and sometimes the window for a valid business question is measured in hours. With the tools in this book, we intend to close that gap. If you’re an SAP developer, we would strongly advise you to learn programming languages like Python and R so that you can use them to do your analytics on SAP data. Firstly, they are not limited to the SAP ecosystem and secondly, they are free.

Outside of SAP, there are numerous other tools to help SAP developers present their SAP data. You can use RMarkdown in R, Shiny in R, Jupyter Notebooks in Python, PowerBI, Tableau, Plotly...the list goes on. In this book we will provide presentation examples using PowerBI, RMarkdown, and Jupyter Notebooks.

A Quick Look: SAP Basics for Data Scientists

The lack of awareness around SAP is often surprising considering its size and ubiquity. Here’s an amazing fact: 77% of the world’s transaction revenue is involved—in one way or another—with an SAP system. If you spend money, you have more than likely interacted with SAP. And 92% of the Forbes Global 2000 largest companies are SAP customers.

But how in the world does SAP software touch all that? What does it do? While in recent years SAP has acquired a number of SaaS (Software as a Service) companies to broaden its portfolio and make shareholders richer, it began with its core focus on ERP: enterprise resource planning.

SAP started in Germany in 1972 under the sexy moniker Systemanalyse und Programmentwicklung. Running under DOS on IBM servers, the first functionality was a back-office financial accounting package. Modules soon followed for purchasing, inventory management, and invoice verification. You can see the theme emerging: doing the common stuff that businesses need to do.

That list of functionality may seem rather dull at first, especially to us cool hipster data scientists with Python modules and TensorWhatsits who know how to make a computer tell us that a picture has a dog (but not an airplane) in it. It’s not magic like searching Google or using Siri on your iPhone. But SAP added a twist to those first few boring modules: integration. Inventory management directly affected purchasing, which directly affected financials, which directly affected...well, everything. That single SAP ERP system contained all of these modules. Now, instead of having to purchase and run separate financial/inventory/invoicing systems, companies saved loads of money. When one system gave them all the answers to business questions, customers started buying in droves. That was the value and the win of ERP. By the time Gartner coined the term ERP in the 1990s, SAP was doing over a billion Deutsche marks in yearly sales.


Since such a high percentage of large companies around the world use SAP for so many business-critical functions, is it any wonder that so much business can be conducted inside it?

Getting Data Out of SAP

Like most large business applications, SAP ERP uses a relational database to house transactional and master data. It’s designed such that customers can choose from many relational database management systems (RDBMS) to function as the SAP application database. Microsoft SQL Server, IBM DB2, Oracle, and SAP’s MaxDB are all supported. In the last few years, SAP has rapidly introduced another proprietary database technology, HANA, as an RDBMS solution with in-memory technology. While future versions of SAP’s core ERP product will one day require HANA, most SAP installations today still use one of the other technologies as their database.


In this book, we will introduce several ways of getting data out of your SAP system, none of which will require you to know exactly which DB your SAP system runs on. But if you’re a true nerd, you’ll find out anyway.

The relational databases that power the SAP instances at your company are huge and full of transactional and master data. They fully describe the shape of the vital business information stored and processed by SAP. The databases at the heart of your SAP systems are the source of truth for the discoveries you can make.

And unless it’s your absolute last resort, you should never directly connect to them.

All right, we’re being a little facetious here. You will find valid times to directly query data from the SAP databases with SQL statements. But the sheer size and incredible complexity of the data model make it so that fully understanding the structure of a simple sales order can involve over 40 tables and 1000+ fields. Even SAP black belts have difficulty remembering all the various tables and fields they need to use, so imagine how inefficient it would be for a data scientist who is new to SAP to unpack all the various bits of requisite data.

BAPIs: Using the NetWeaver RFC Library

Data nerds who don’t know SAP that well should start by examining the available Business Application Programming Interfaces (BAPIs) in the SAP system. BAPIs are remote-callable functions provided by SAP that expose the data in various business information documents. Instead of figuring out which of the 40+ sales order tables apply to your particular data question, you can look at the structure of various sales order BAPIs and determine if they fill that gap. The trouble of reverse engineering the data model is gone.

BAPIs also help by covering over system limitations from earlier versions. During the early period of SAP’s core product development, the various modules restricted the number of characters that could denote a table or field. With SAP’s remarkable stability over the years, those table and field names have stuck around. Without living inside SAP, how could you possibly know that “LIKP” and “VBELN” have anything to do with delivery data? BAPIs are a later addition, so they have grown up with interfaces that better describe their shape and function.


SAP NetWeaver Gateway represents one of SAP’s many ways of breaking into the modern web era. It’s an SAP module—in some cases running enough of its own stuff to be worth a separate system—that allows SAP developers to quickly and easily establish HTTP connections to SAP backend business data. We predict that you’ll see examples of using SAP NetWeaver Gateway in Chapter 6.

The foundational layer of transport is known as OData. OData represents many tech companies coming together to put forward a standard way of communicating over the web via RESTful APIs. It provides a common format for data going over the web using either XML or JSON, ways for clients to indicate the basic create/read/update/delete operations for server data, and an XML-based method for servers to specify to clients exactly the fields, structure, and options for interacting with data that the servers provide via metadata.

Using OData through SAP NetWeaver Gateway requires programming in SAP’s native backend language, ABAP. Some of our SAP-native readers may be well versed in this language and can produce Gateway OData APIs. Other readers will likely be unfamiliar, but should take solace: if your company runs SAP in any meaningful way, your company will have people who know ABAP. These people will either know how to create OData services, or will be able to quickly learn since it’s not difficult.

Choose OData when you can’t find a BAPI that meets your data needs. It’s a great middle ground that provides SAP administrators with the flexibility to meter and monitor its usage. It also gives developers the ability to put together data in any way they choose. Another benefit of using OData is that it doesn’t require a NetWeaver connector like the BAPI method: any device that can make HTTP requests safely inside the corporate network will be able to make OData requests.

Other ways to get data

If you can’t find the right BAPI and you can’t find the resources to make an OData service, there are always a few other routes you can take. We’ll cover those more briefly, since they aren’t things we typically recommend.2

Web services

SAP allows you to create web services based on its Internet Communication Manager (ICM) layer. These web services allow you to work even more flexibly than OData, but they still require ABAP knowledge. The space between OData with Gateway and a totally custom SAP web service is small—consider carefully whether your data question can’t be answered with OData.

Direct database access

Everyone says you shouldn’t, but we’ve all also encountered one or two times when it was the only thing that would work. If you need to go this route, a key task will be ensuring that the data you extract matches up with what SAP provides on the screen to end users. Many times there are hidden input/output conversions and layers of data modeling that don’t become apparent when just browsing through a data model.

Seriously. Picking directly from an SAP database is like driving a Formula One car with brake problems. You’ll get where you need to go really fast, but you’ll probably smash into a wall or two on the way.

Screen dumps to Excel

Sometimes an end user will know exactly which screen has the right data for them. Many times this screen will have a mechanism for exporting data to Excel.

Roles and Responsibilities

Data science combines a range of skill sets. These often include statistics, programming, machine learning, analysis, architecture, and engineering. Many blogs and posts online discuss the differences between data science roles. There are innumerable job titles and delineations. One camp defines roles into data analysts, data engineers, data architects, data scientists, and data generalists. Other groups have their own delineations.

Readers should understand something very important. Unless you are at a very large company with a data science team, you will be lucky to have one person on your team with some of these skills. These job delineations exist in theory for all, but in practice for only a small percentage. Be prepared to wear many hats. If you apply some of these forays into data science at your company, be prepared to do the work yourself. Don’t have a SQL database and want to extract and store some SAP data? We’ll introduce this. Want to automate a workflow for extraction? Here you go. Everything from the SAP data to the presentation layer will be covered.

Our intention is clear: we want to create citizen data scientists who understand what it takes to make data science work at their organizations. You may not have any resources to help you, and you may get resistance when you ask for some of these things. Often, you must prove your theory before someone helps. We understand that the roles and responsibilities are not well defined. We hope to give you an overview of the landscape. If you’re reading this book, you’ve already rolled up your sleeves and are ready to do everything from building SQL databases to presenting machine learning results in PowerBI.


A huge part of getting value is communicating it. We went over how to tell great stories with the data you find in SAP: identify your story, find the audience, discover the data, and apply rigorous tooling to that discovered data. Sometimes all it takes to communicate the story is one simple graph. Other times it may require detailed lists of results. But no matter what visual method conveys your findings, be prepared to tell a story with it.

SAP professionals looking to tell stories about their data should look at tools such as programming languages like Python and R, and visualization tools like Tableau and Power BI. Look at Chapter 2 to dive deeper.

Data scientists looking to discover what’s in SAP should look at ways of getting that data out. BAPIs provide a function-based approach to retrieving data, OData sets up repeatable and predictable HTTP services, and you can always dump screen data to Excel or directly query the SAP database as a last resort. Look at Chapter 3 to find out more.

We want you to get the most out of the SAP data that’s ripe for the picking in your enterprise, and the best way to get value out of raw data is by applying data science principles. This book will show you how to marry the world of SAP with the world of data science.

1 For a taste of how expansive the R package landscape is, see this blog post for perspective on package list growth and search strategies for finding the right ones.

2 However, this book couldn’t be called “Practical” if we didn’t acknowledge that the worst hacks and ill-advised duct-tape solutions make up at least 50% of any real-world environment.

Get Practical Data Science with SAP now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.