Chapter 1. Why Use External Data in Your Analytics?

According to Analytics Steps, most companies nowadays are taking advantage of data to improve their competitive position and market response rate. Newsfeeds are awash with stories detailing how retailers, banks, and social media platforms are leveraging data. You can’t seem to buy a coffee, post on social media, or listen to your favorite song without a company asking for your personal details. Global giant Spotify, for instance, delivers music to listeners around the world and openly uses internal data for the following purposes:

  • Marketing, promotion, and advertising campaigns

  • Feature development and evaluation

  • Business planning, reporting, and forecasting

  • Fraud detection and prevention

However, Spotify also uses third-party or external data to deliver more relevant advertising to its listeners. Spotify’s ad partners help them facilitate tailored ads to match your interests or moods, such as which cars a car lover might want to know more about.

Today, innovative organizations—those already using advanced analytics software powered by big data—are like Spotify. According to TechTarget, they’re combining data from a variety of internal and external sources to enhance customer service, boost sales, make marketing more efficient, enhance products and services, and infuse more real intelligence into their operations.

Instead of simply deriving static reports from data that has been moved in and out of data warehouses, clever companies are using advanced analytics tools that can simultaneously collect, mix, and match diverse data from disparate data sources in order to improve products and brand loyalty, generate better conversions, identify trends earlier, and pinpoint additional ways to improve overall customer satisfaction.

According to Jennifer Belissent’s Forrester blog, the organizations that can create better infrastructure to collect, store, analyze, and leverage external data—and successfully integrate it into their operations with their internal data—can outperform other companies by unlocking improvements in growth, productivity, and risk management.

This report expands on the aforementioned points and answers the following questions expounding on the rise of external data:

  • How is new technology making external data easier to use with analytics?

  • How does an external data platform fit into your data architecture?

  • How can you start leveraging external data today?

Fusing Internal Data with the Right External Data

Footfall traffic is a good example. Footfall is how retailers describe the number of customers who enter their stores. Cuebiq explains footfall attribution, a related concept, as a method used to correlate digital marketing campaign impressions and actual store visits.

According to Knorex, footfall attribution is essentially an ingenious mix of mobile campaign impression results and the data collected from actual store visits. Instead of just relying on mobile marketing techniques, cafes, restaurants, supermarkets, and various retailers can use footfall attribution to gain valuable insight into competitive analysis, temporal analysis, and customer analysis. Most importantly, they can measure what matters—their precise sales growth.

While footfall traffic is an extremely effective approach for helping retailers and restaurants monitor and grow sales, it is only the beginning of the possibilities with external data. For example, the Wall Street Journal reported that retail giant Canadian Tire delivered a 19.1% increase in year-over-year retail sales through Q3 2020 by mixing and matching foot traffic, weather, traffic patterns, and shifts in demand for bicycles and outdoor furniture.

Hershey, meanwhile, was better able to pivot its supply chain during the 2020 lockdowns by understanding the different types of chocolate that were being consumed at home. According to the Wall Street Journal, this directly led to a 5.5% increase in same-store sales.

Tom Davenport, author of The AI Advantage, underscores the ever-expanding importance of fusing internal data with the right external data, when he wrote for MIT Sloan Management Review:

Trying to model low-probability, highly disruptive events will require an increase in the amount of external data used to better account for how the world is changing. The right external data could provide an earlier warning signal than what can be provided by internal data.

Data Hunters Know That Quality Always Outfoxes Quantity

As decision makers are coming around to understanding the value that seeing the big data picture brings to their respective organizations, they are working harder than ever to find the correct data sources that will give them an edge. Deloitte Insights notes that 92% of data analytics professionals said their firms needed to increase the use of external data sources, while 54% said their company plans to increase spending on it. As noted in Belissent’s blog, the former chief data officer (CDO) of Flagstar Bank acknowledged, “With our own data, we can only look internally. We need to see industry benchmarks, regional trends, what waves we can ride on; we derive competitive advantage by getting data from outside and enhancing our own data.”

This new quest for the Holy Grail of data—to get a leg up with an inside scoop or unique source of information or local knowledge—has led Forrester to the following conclusions:

The demand for external data is increasing in parallel with firms’ abilities to source it.
Companies are already well aware of the need to better leverage internal data such as transaction data, customer interactions, and other process and performance metrics. Yet, there is more need now to supplement their internal data with external data such as weather, traffic, social media listening, partner data, and economic data from third-party sources.
The supply of data is keeping pace with demand.
As more companies are looking for external data, data supply and data sources have accelerated. Traditional data providers are issuing new data offerings as “originators scrape websites for pricing and product information.”
The number of firms commercializing data is mushrooming.
Data marketplaces have been popping up all over the place. Data commercialization is also on the rise, with companies developing new data-fueled products and services and data brokers assisting clients to spot new sources and even host events for niche data to meet potential buyers.

The rise in demand for external data has fostered a new employment role and changed the face of mergers and acquisitions. According to InformationWeek, data hunters—also known as data acquisition specialists and data scouts—are now in high demand, and, according to Nextgov, companies are creating new positions solely to find the best external data. As Econsultancy reported, Swiss pharma giant Roche’s acquisition of the health technology company Flatiron Health, which helps collect data that could be used for cancer research, also shows how far companies are willing to go to get the best third-party data.

How Does External Data Improve Analytics?

In this section, we explore how external data elevates and enhances how organizations can analyze and interpret data outside of their apps or databases.

Learning About the Power of External Data from Your Next Dinner Delivery

If you’re working from home and starving for lunch, you’re more likely to stay inside and order takeout if it’s raining outside. Any one of the thousands of food delivery drivers across America knows that bad weather leads to delivery spikes. But now, drivers for Deliveroo have at their fingertips the external data to back up this claim, and they can also instantaneously share changes in the weather or traffic with fellow drivers.

For online delivery platforms such as Deliveroo, leveraging external data is becoming paramount in an industry with stiff competition and razor-thin profit margins. Meanwhile, DoorDash, a leading online delivery platform, is looking to upgrade its data architecture to access geospatial data to better understand the economic impact of a store’s location, according to Snowflake, “to analyze different configurations and extend the geometry to influence supply and demand.”

Another online food ordering company, Grubhub, studies external data to understand the loyalty of its customers. Grubhub believes its online diners are becoming “more promiscuous,” that is, the company is concerned that its newer diners are increasingly coming to Grubhub after already having made orders on a competing online platform.

Grubhub’s external data indicates that its existing diners are increasingly ordering from multiple platforms and find that this so-called “platform-sharing” is most common among its newest diners and markets. The trend, however, is also spreading to Grubhub’s core diner base.

For instance, according to Bloomberg Second Measure, 61% of Grubhub customers did not use another meal delivery service in the second quarter of 2019, but that number fell to 46% two years later. DoorDash, conversely, had 58% of its customers using them exclusively in the second quarter of 2021. By harnessing external data, Grubhub now realizes that the “easy wins in the market” are quickly evaporating.

Data Is Food for AI, So Don’t Feed It Junk

Computer scientist and technology entrepreneur Andrew Ng recently went on record to emphasize a shift toward a data-centric approach to machine learning and AI. He explains the importance of companies using the correct data over simply using more data. His message is straightforward and might even be kept in mind the next time you decide to use a meal delivery service: data is food for AI, so don’t feed it junk. Ng makes several sobering observations:

  • He notes that 80% of the effort on a machine learning/AI project is spent on preparing the data, and only 20% on modeling. Despite this fact, 99% of AI research focuses on model-centric approaches to improving results.

  • The most vital task of MLOps is to ensure consistently high-quality data in all phases of the machine learning project lifecycle.

  • Cleaning up labels (making them more consistent) is a more efficient way of improving accuracy than collecting more data, especially for small datasets (<10,000 observations).

Ng reiterates that big data should focus on improving data rather than model accuracy: “Now that the models have advanced to a certain point, we have got to make the data work as well.”

In an ACM Transactions on Computer-Human Interaction article, Google researchers support Ng’s claim:

Paradoxically, for AI researchers and developers, data is often the least incentivized aspect, viewed as ‘operational’ relative to the lionized work of building novel models and algorithms. Intuitively, AI developers understand that data quality matters, often spending inordinate amounts of time on data tasks. In practice, most organizations fail to create or meet any data quality standards, from under-valuing data work vis-a-vis model development.

Nothing threw big data’s model-centric approach into question more than the COVID-19 pandemic. A recent McKinsey article states:

In a few short months, consumer purchasing habits, activities, and digital behavior changed dramatically, making pre-existing consumer research, forecasts, and predictive models obsolete. Moreover, as organizations scrambled to understand these changing patterns, they discovered diminishing value in their internal data. Meanwhile, a wealth of external data could—and still can—help organizations plan and respond at a granular level.

According to Forbes, advanced external data may include brand loyalty on social media, real-time product information (price, discount, stock status, etc.) in ecommerce marketplaces or competitors’ websites, and suppliers’ information tracking. Kabbage is an example of a fintech company taking advantage of external data. Kabbage determines eligibility for issuing loans—and the terms under which a business would pay it back—by tapping a vast variety of sources, from traditional accounting statements to social media signals. The social media company then loads this data into its proprietary machine learning algorithms.

The report from Deloitte Insights notes multiple other examples of analytics programs generating value from external data, such as helping businesses personalize marketing offers, enhancing HR decisions, acquiring new revenue streams by launching new products or services, enhancing risk visibility and mitigation, and anticipating shifts in demand more precisely for their products and services.

Angie King, principal at End-to-End Analytics, recently told the MIT Sloan Management Review: “The benefit of using external data is so great that there are businesses built around gathering this data, consolidating it, cleaning it, and packaging it up for use by other companies.” King also noted that using external data can improve a company’s predictive analytics and machine learning models: “Without having external data capturing these events, the predictive models wouldn’t be able to infer the reason for the resulting spikes or drops in sales.”

Whether it’s an agro corporation using geolocation and weather data to help a farmer, a bank accessing social media to determine credit worthiness, or a logistics manager using a news feed to determine potential supply chain disruptions, companies should start identifying the best new technology that is making external data easier to use to augment their analytics and machine learning models.

Differentiating Data Providers from Data Marketplaces

While collecting the correct data is paramount, timely, relevant, and high-quality data remains elusive. A Forrester Consulting study found that 99% of firms surveyed faced issues with customer data, while 96% indicated that timelines and accuracy issues with acquisition of customer data were big problems. Determining what data is needed before purchasing it can be difficult, and pinpointing up-to-date, high-grade data is equally challenging. This problem has led to the emergence of data marketplaces.

Data marketplaces are online platforms that facilitate the buying and selling of datasets from several different sources. Data marketplaces are usually cloud services where individuals or businesses upload data to the cloud and provide self-service data access while guaranteeing security, consistency, and high quality of data for both parties.

Data marketplaces also facilitate data monetization. An AI software platform that wants to train and sell its AI-based models, for example, could purchase data from a marketplace. Data marketplaces include personal, B2B, and Sensor/IoT (the Internet of Things) data, and offer the following types of data:

  • Business intelligence (BI)

  • Market research

  • Geospatial

  • Demographic

  • Firmographic

  • Public

While data marketplaces seek to build an ecosystem of data providers and data consumers by providing data access, purchasing a dataset is no guarantee of a specific business outcome. Buying datasets can also come at a very steep price, as licenses can cost as much as hundreds of thousands of dollars.

Here are some other limitations of data marketplaces:

  • Integrating external data can be a costly challenge requiring separate tools, platforms, or data science teams.

  • Purchasing, formatting, handling, managing, and integrating external data doesn’t guarantee ROI.

  • Locating the most robust, applicable data isn’t always clear.

  • Matching the format that the organization’s data is in can be a serious setback.

  • Adhering to security and compliance regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) can become a major headache.

Understanding which data is best for your business needs and then unlocking and integrating it with your own internal data is where an external data platform steps in. An external data platform can help with every step of the data acquisition process—from data discovery to data prep, integration, model training, compliance, deployment, and model retraining. An end-to-end external data platform can provide access to all of the relevant external data sources in one platform, allowing you to understand which data signals you need and how they will drive ROI.

What Are Data Signals?

Data signals are pieces of data that help you understand and contextualize the experiences or situations of your audience, clients, partners, or patients. Common data signal categories include:

Company data
Basic data like industry and North American Industry Classification System (NAICS)/Standard Industrial Classification (SIC) codes, economic indicators like revenue trends, technologies, reviews, and more
People data
Contact information, social networks, interests and hobbies, purchase habits, and more
Geospatial data
Demographic data, footfall traffic and trends, and other indicators
Temporal data
National and local events, weather indicators and history, etc.
Product data
Pricing data, ratings, and retailers, etc.

KPMG notes how the three following data signals have specifically been used to help healthcare during the COVID-19 crisis:

  • Static (e.g., the geographical distance between a home and a hospital)

  • Slow-moving (e.g., the ratio of health-care professionals to people in an area)

  • Fast-moving (e.g., first-time unemployment filings)

These signals showed which locations had a predisposition for a larger and longer sustained impact from the pandemic and—by extension—more lasting macroeconomic effects. An example of a macroeconomic effect would be if people in a specific area rely on mass transit or if the density of people in the same area has past diagnoses of certain relevant diseases. KPMG is also adding a vital fourth new category: pop-up data. The most recent number of confirmed COVID-19 cases reported in a specific city or county serves as a relevant example.

KPMG also notes that—by leveraging all four varieties of signals—insurers and lenders can now look into and explore and mitigate risk at an extremely local level to help business leaders predict which potential customers carry the greatest risk of mortgage default or insurance claim submission.

Marketers, meanwhile, can implement external audience data known as experience signals that come from different systems, channels, and in-house technology. Signal types include digital clickstream data, ecommerce information, POS (point-of-sale) data, call center interactions, CRM data, service interactions, IoT data, HR data, sentiment captured from videos, sales and marketing tools, and even survey data.

“Spray and Pray” Marketing Methods Are Being Replaced with More Personalized Interactions That Use Relevant External Data

Marketing methods such as surveys, however, may be going the way of the horse and buggy. A recent Deloitte report reiterates that even though customers are increasingly frustrated by such generic offers, most marketers continue their “spray and pray” mass marketing techniques and show little sign of changing. Today’s customers, however, only want interactions that are relevant, personalized, and based on a consumer’s situation and preferences.

The casino industry, for example, is now revisiting how third-party data can supplement traditional first-party data and gaming metrics. Casinos are now looking at three key factors:

  • The growing importance of nongaming spend

  • The rapid growth of digital (accelerated by the pandemic)

  • The rapidly rising consumer expectations around marketing personalization

Casino marketers can now dive deeper into what their guests and prospects are consuming online and which intent signals they’re exhibiting, such as other sites visited, common online transactions, interests and hobbies, and countless other individual variables. Casino marketeers can now create an experience and message that matches real consumer needs.

Casinos also now understand that they’re competing with travel brands across a consumer’s share of travel. With the integration of third-party data, casino marketers are able to see what other brands their customers are engaging with and use that knowledge to help them understand their broader competitive set.

What Are Data Features?

Data features are specific variables that make up a dataset. The most common features, or measurable pieces of data, include name, age, nationality, race, height, weight, and sex. A feature’s data type can be determined to be a percentage, a category, a number, a date, etc.

Determining the correct feature depends on which business problem you would like to solve and what your business goals are. Even within the same industry, different businesses require different features. However, displaying too much information can divert focus away from the essential metrics, and overloading an analytics model with unnecessary features can decrease the accuracy and negatively affect the model’s efficiency. This is where feature engineering comes into play and ensures that attributes relevant to the business problems are the only ones selected and fed into the analytics model.

Choosing the correct features will greatly enhance the efficacy of your machine learning model. At the same time, intelligent feature engineering optimizes models by selecting only the relevant variables, thereby reducing the effort to retrain a model if new features are added later on.

Get Why External Data Needs to Be Part of Your Data and Analytics Strategy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.