Chapter 1. Introduction
Telling Better Stories with Data
Not enough gets said about abandoning crap.
Ira Glass
Weâve all seen them. The intimidating PowerPoint presentations with the army of bullet points marching down the screen. Often the lecturer will even apologize for the busy slide and then continue to present, reading every word on the slide exactly as printed. You start to wonder if you left the oven on last night. We all like stories. A well-constructed narrative in the form of a movie, book, television show, or podcast wraps around us like a blanket and draws our attention. The bullet-ridden PowerPointâ¦not so much. With the deluge of data that has come with the advent of the internet and IoT, we are tempted to splash some findings in a presentation, wipe our hands, and say âthat is that.â However, as data professionals we canât just rain data findings down on our audience. The prevailing advice is that you must tell a story with dataâmake sure itâs a compelling story that people want to hear. Donât deny yourself the joy that storytelling can bring.
To tell a compelling story, you must identify it. What is being asked of my data? What insights are my users looking for? A company that specializes in providing services and equipment might ask, âWhat equipment needs servicing the most? The least? Is there a correlation between equipment type and parts replacement?â At that same company someone in the finance department might ask, âHow can we more accurately predict cash-on-hand?â In sales the question might be, âWhat kind of customer churn do I have?â
After youâve identified your story, youâll need to find your audience. There are many ways to break them down, but generally your audience includes executives, business professionals, and technical professionals. While they might manage or direct many business processes, executives often know little about the daily functioning of such processes. The detail is irrelevant (or possibly confusing) to themâthey want to know the story in big bold letters. Business professionals are the daily administrators of a business process, such as super users and business analysts. They know the process in detail and can understand raw tabular data. Technical professionals are the smallest segment of your audience; they usually comprise colleagues in data analytics and data science teams. This group requires less business and process background and more technical details such as the root-mean-squared error of the regression or the architecture of the neural network.
Once youâve got your story and audience set, youâll need to move forward with the most difficult and tenuous part of the journey: finding the data. Without the data to support your story, your journey will quickly come to an end. Letâs say you wanted to tell the story of how sunspots correlate to sales of hats and mittens in the northern hemisphere. Surprisingly, sunspot data is easy to obtain. You got that. However, you only have details on sales of hats, not mittens. You canât find that data. A cautious step is needed here. Do you alter your story to fit the data or do you cut bait and find another story? Reversing the process can be done but itâs a slippery slope. As a general rule, do not change your hypothesis to match your data.
Before you fully trust that data, youâll need to vet it and start asking a lot of questions:
Is the source reputable? Did you scrape the data from a table on a website? What sources did that website use for the data, and how was it obtained? Sources such as Data.gov, ProPublica, the US Census Bureau, and GapMinder are trustworthy, but others might need a dash of caution.
Do you have too much data? Are there easily recognizable, worthless features? Look for features that are obviously precisely correlated. In the sunspot data mentioned earlier, perhaps you have a UTC timestamp feature and two other features for date and time. Either the date and time should be thrown out or the timestamp. You can quickly look at correlations using techniques we will discuss later to help you identify when two features are too closely correlated for both of them to be useful.
Is the data complete? Use some preliminary data tools to make sure your data is not missing too much information. Weâll discuss this process in more detail later.
With the story in place, the audience identified, and the data vetted, whatâs next? Youâre now ready for the art and fun of the storyâidentifying what tools to use to either support or reject your null hypothesis. To say youâre using âdata scienceâ as a tool is a slippery slope. You have advanced reporting, machine learning, and deep learning in your arsenal. Often, just the organization of the data into an easy-to-use dashboard tells the whole story. Nothing more needs to be done. As deflating as that has been in our careers, it has happened more times than any other scenario. We start the journey thinking that we have a case for a recursive neural network with either a gated recurrent unit or a long short-term memory module. And the excitement builds while weâre gathering the data. Then we realize a support vector machine or a simple regression would do just as well. Later, with not a little disappointment, we realize that a dashboard for users to explore the data is more than enough to tell the story. Not everything requires deep or even machine learning. Although it can often be entertaining, shoehorning your story into these paradigms often does not tell the story any better.
Finally, take a little time to learn a bit about the art of storytelling. Even our dry data science stories deserve some love and attention. Ira Glass is a fantastic storyteller. He has a series of four short videos on the art of storytelling. Watch them and sprinkle some of his sage advice into your story.
A Quick Look: Data Science for SAP Professionals
SAP professionals are busy every day supporting the business and users, constantly looking for process improvements. They gather requirements, configure or code in the SAP system, and, more often than not, live in the SAP GUI. They have intimate knowledge of the data within SAP as well as the business processes and can summon an army of transaction codes like incantations. When asked for a report with analytics, they really have two options: code the report in SAP or push the data to a data warehouse where someone else will generate the report. Both of these processes are typically long, resource-intensive endeavors that lead to frustration for the end user and the SAP professional. For one particular client, the biggest complaint from the SAP users was that by the time they actually got a requested report, it was no longer relevant.
Reading this book will help youâthe SAP professionalâbuild a bridge between the worlds of the business professional and the data scientist. Within these pages you will find ideas for getting out of the typical reporting and/or analytics methodology that has hitherto been so restrictive. As we discussed earlier, one of the first ways to do that is to simply ask better questions.
Hereâs a typical SAP scenario: Cindy works in Accounts Receivable. She needs a 30-60-90 day overdue report listing past due customers and putting them into buckets according to whether they are 30 days, 60 days, or 90 days past due. Sharon in Finance gets the request and knows that she can have a standard ALV (ABAP List Viewer) report created or can extract the data and push it to a business warehouse (BW) where they will generate a report using Microstrategy or whatever tools they have.
What if we shifted Sharonâs perspective to that of a data scientist? Sharon gets the report request. She knows she can deliver just what was requested, but then she thinks, âWhat more can be done?â She opens up a notepad and jots down some ideas.
Are there repeat offenders in late payments?
Are there any interesting correlations in the data? We know the customer name, customer payment history, customer purchases, and dollar amount.
Can we predict when a person will be paying late? How late?
Can we use this data to help rate our customers? Lower rated customers may not get an order when inventory is low and a higher rated customer also makes the same request.
What types of visualizations would be helpful?
Sharon sketches out an interactive dashboard report that she thinks would be very useful for her users. Armed with these ideas and sketches, Sharon asks the department data scientist (or SAP developer) about the possibilities.
There is a distinctive difference in approaches here. The first is a typical SAP response, and limits the creative and intellectual capacity of the business analysts. The second leverages their creativity. Sharon wonât just provide the requested information. When she sees the data in SAP and asks better questions, sheâll be instrumental in substantial process improvements.
This is just one example. Think of the possibilities with all the requests a typical SAP team gets, and hence this book!
Another way to shift the thinking of the SAP team to be more dynamic and data centric is to use better tools. This is the responsibility of the SAP developer. Most SAP developers live in the world of its application programming language called ABAP (Advanced Business Application Programming), and when asked to provide reports or process improvements turn instantly to the SAP GUI or Eclipse. This is where theyâre expected to spend time and deliver value.
Tip
ABAP was originally Allgemeiner Berichts-Aufbereitungs-Prozessor. Itâs a server-side language specially designed to extend the core functionality of SAP. You can create programs that display reports, run business transactions or ingest outside system data and integrate it into SAP. A great deal of SAP ERP transactions run solely on ABAP code.
ABAP developers often specialize in one or more of the business functions that SAP provides. Since ABAP programs often directly enhance standard SAP features, ABAP developers become very familiar with how enterprises design their processes. Itâs very common for people familiar with ABAP to perform both technical programming roles and business analyst roles.
Tip
SAP developers, we implore you: view SAP as a data source. The presentation layer and logic layer of reports should be abstracted away from the database layer (see Figure 1-1). It is worth noting that SAP data is highly structured with strict business rules. One of the most obvious advantages to this approach is the logic layer has access to other sources of data, such as public data. Within an SAP system, if a request was made to view the correlations between sales of galoshes and weather patterns, the weather data from the NOAA would have to be brought into either BI or SAP itself. However, by using a tiered model the data can be accessed by the logic tier and presented in the presentation layer. Often the data may be an API, which allows for access without storage. This model also allows the logic tier to tie into tools like Azure Machine Learning Studio to perform machine or deep learning on the SAP data.
SAP lacks the thousands of libraries in Python or the thousands of packages in R.1 It also lacks the ability to easily create dynamic/interactive dashboards and visualizations. Donât get us wrong: SAP does have tools to do advanced analytics, dashboards, and visualizations. Itâs just that they cost a lot of money, effort, and time. Some places have lead times measured in months or quarters before reports can be created, and sometimes the window for a valid business question is measured in hours. With the tools in this book, we intend to close that gap. If youâre an SAP developer, we would strongly advise you to learn programming languages like Python and R so that you can use them to do your analytics on SAP data. Firstly, they are not limited to the SAP ecosystem and secondly, they are free.
Outside of SAP, there are numerous other tools to help SAP developers present their SAP data. You can use RMarkdown in R, Shiny in R, Jupyter Notebooks in Python, PowerBI, Tableau, Plotly...the list goes on. In this book we will provide presentation examples using PowerBI, RMarkdown, and Jupyter Notebooks.
A Quick Look: SAP Basics for Data Scientists
The lack of awareness around SAP is often surprising considering its size and ubiquity. Hereâs an amazing fact: 77% of the worldâs transaction revenue is involvedâin one way or anotherâwith an SAP system. If you spend money, you have more than likely interacted with SAP. And 92% of the Forbes Global 2000 largest companies are SAP customers.
But how in the world does SAP software touch all that? What does it do? While in recent years SAP has acquired a number of SaaS (Software as a Service) companies to broaden its portfolio and make shareholders richer, it began with its core focus on ERP: enterprise resource planning.
SAP started in Germany in 1972 under the sexy moniker Systemanalyse und Programmentwicklung. Running under DOS on IBM servers, the first functionality was a back-office financial accounting package. Modules soon followed for purchasing, inventory management, and invoice verification. You can see the theme emerging: doing the common stuff that businesses need to do.
That list of functionality may seem rather dull at first, especially to us cool hipster data scientists with Python modules and TensorWhatsits who know how to make a computer tell us that a picture has a dog (but not an airplane) in it. Itâs not magic like searching Google or using Siri on your iPhone. But SAP added a twist to those first few boring modules: integration. Inventory management directly affected purchasing, which directly affected financials, which directly affected...well, everything. That single SAP ERP system contained all of these modules. Now, instead of having to purchase and run separate financial/inventory/invoicing systems, companies saved loads of money. When one system gave them all the answers to business questions, customers started buying in droves. That was the value and the win of ERP. By the time Gartner coined the term ERP in the 1990s, SAP was doing over a billion Deutsche marks in yearly sales.
Note
Since such a high percentage of large companies around the world use SAP for so many business-critical functions, is it any wonder that so much business can be conducted inside it?
Getting Data Out of SAP
Like most large business applications, SAP ERP uses a relational database to house transactional and master data. Itâs designed such that customers can choose from many relational database management systems (RDBMS) to function as the SAP application database. Microsoft SQL Server, IBM DB2, Oracle, and SAPâs MaxDB are all supported. In the last few years, SAP has rapidly introduced another proprietary database technology, HANA, as an RDBMS solution with in-memory technology. While future versions of SAPâs core ERP product will one day require HANA, most SAP installations today still use one of the other technologies as their database.
Tip
In this book, we will introduce several ways of getting data out of your SAP system, none of which will require you to know exactly which DB your SAP system runs on. But if youâre a true nerd, youâll find out anyway.
The relational databases that power the SAP instances at your company are huge and full of transactional and master data. They fully describe the shape of the vital business information stored and processed by SAP. The databases at the heart of your SAP systems are the source of truth for the discoveries you can make.
And unless itâs your absolute last resort, you should never directly connect to them.
All right, weâre being a little facetious here. You will find valid times to directly query data from the SAP databases with SQL statements. But the sheer size and incredible complexity of the data model make it so that fully understanding the structure of a simple sales order can involve over 40 tables and 1000+ fields. Even SAP black belts have difficulty remembering all the various tables and fields they need to use, so imagine how inefficient it would be for a data scientist who is new to SAP to unpack all the various bits of requisite data.
BAPIs: Using the NetWeaver RFC Library
Data nerds who donât know SAP that well should start by examining the available Business Application Programming Interfaces (BAPIs) in the SAP system. BAPIs are remote-callable functions provided by SAP that expose the data in various business information documents. Instead of figuring out which of the 40+ sales order tables apply to your particular data question, you can look at the structure of various sales order BAPIs and determine if they fill that gap. The trouble of reverse engineering the data model is gone.
BAPIs also help by covering over system limitations from earlier versions. During the early period of SAPâs core product development, the various modules restricted the number of characters that could denote a table or field. With SAPâs remarkable stability over the years, those table and field names have stuck around. Without living inside SAP, how could you possibly know that âLIKPâ and âVBELNâ have anything to do with delivery data? BAPIs are a later addition, so they have grown up with interfaces that better describe their shape and function.
OData
SAP NetWeaver Gateway represents one of SAPâs many ways of breaking into the modern web era. Itâs an SAP moduleâin some cases running enough of its own stuff to be worth a separate systemâthat allows SAP developers to quickly and easily establish HTTP connections to SAP backend business data. We predict that youâll see examples of using SAP NetWeaver Gateway in Chapter 6.
The foundational layer of transport is known as OData. OData represents many tech companies coming together to put forward a standard way of communicating over the web via RESTful APIs. It provides a common format for data going over the web using either XML or JSON, ways for clients to indicate the basic create/read/update/delete operations for server data, and an XML-based method for servers to specify to clients exactly the fields, structure, and options for interacting with data that the servers provide via metadata.
Using OData through SAP NetWeaver Gateway requires programming in SAPâs native backend language, ABAP. Some of our SAP-native readers may be well versed in this language and can produce Gateway OData APIs. Other readers will likely be unfamiliar, but should take solace: if your company runs SAP in any meaningful way, your company will have people who know ABAP. These people will either know how to create OData services, or will be able to quickly learn since itâs not difficult.
Choose OData when you canât find a BAPI that meets your data needs. Itâs a great middle ground that provides SAP administrators with the flexibility to meter and monitor its usage. It also gives developers the ability to put together data in any way they choose. Another benefit of using OData is that it doesnât require a NetWeaver connector like the BAPI method: any device that can make HTTP requests safely inside the corporate network will be able to make OData requests.
Other ways to get data
If you canât find the right BAPI and you canât find the resources to make an OData service, there are always a few other routes you can take. Weâll cover those more briefly, since they arenât things we typically recommend.2
Web services
SAP allows you to create web services based on its Internet Communication Manager (ICM) layer. These web services allow you to work even more flexibly than OData, but they still require ABAP knowledge. The space between OData with Gateway and a totally custom SAP web service is smallâconsider carefully whether your data question canât be answered with OData.
Direct database access
Everyone says you shouldnât, but weâve all also encountered one or two times when it was the only thing that would work. If you need to go this route, a key task will be ensuring that the data you extract matches up with what SAP provides on the screen to end users. Many times there are hidden input/output conversions and layers of data modeling that donât become apparent when just browsing through a data model.
Seriously. Picking directly from an SAP database is like driving a Formula One car with brake problems. Youâll get where you need to go really fast, but youâll probably smash into a wall or two on the way.
Roles and Responsibilities
Data science combines a range of skill sets. These often include statistics, programming, machine learning, analysis, architecture, and engineering. Many blogs and posts online discuss the differences between data science roles. There are innumerable job titles and delineations. One camp defines roles into data analysts, data engineers, data architects, data scientists, and data generalists. Other groups have their own delineations.
Readers should understand something very important. Unless you are at a very large company with a data science team, you will be lucky to have one person on your team with some of these skills. These job delineations exist in theory for all, but in practice for only a small percentage. Be prepared to wear many hats. If you apply some of these forays into data science at your company, be prepared to do the work yourself. Donât have a SQL database and want to extract and store some SAP data? Weâll introduce this. Want to automate a workflow for extraction? Here you go. Everything from the SAP data to the presentation layer will be covered.
Our intention is clear: we want to create citizen data scientists who understand what it takes to make data science work at their organizations. You may not have any resources to help you, and you may get resistance when you ask for some of these things. Often, you must prove your theory before someone helps. We understand that the roles and responsibilities are not well defined. We hope to give you an overview of the landscape. If youâre reading this book, youâve already rolled up your sleeves and are ready to do everything from building SQL databases to presenting machine learning results in PowerBI.
Summary
A huge part of getting value is communicating it. We went over how to tell great stories with the data you find in SAP: identify your story, find the audience, discover the data, and apply rigorous tooling to that discovered data. Sometimes all it takes to communicate the story is one simple graph. Other times it may require detailed lists of results. But no matter what visual method conveys your findings, be prepared to tell a story with it.
SAP professionals looking to tell stories about their data should look at tools such as programming languages like Python and R, and visualization tools like Tableau and Power BI. Look at Chapter 2 to dive deeper.
Data scientists looking to discover whatâs in SAP should look at ways of getting that data out. BAPIs provide a function-based approach to retrieving data, OData sets up repeatable and predictable HTTP services, and you can always dump screen data to Excel or directly query the SAP database as a last resort. Look at Chapter 3 to find out more.
We want you to get the most out of the SAP data thatâs ripe for the picking in your enterprise, and the best way to get value out of raw data is by applying data science principles. This book will show you how to marry the world of SAP with the world of data science.
1 For a taste of how expansive the R package landscape is, see this blog post for perspective on package list growth and search strategies for finding the right ones.
2 However, this book couldnât be called âPracticalâ if we didnât acknowledge that the worst hacks and ill-advised duct-tape solutions make up at least 50% of any real-world environment.
Get Practical Data Science with SAP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.