Chapter 4. Deriving Value from the Data Lake

Self-service consumption is essential for a successful data lake. Different types of users consume the data, and they are looking for different things—but each wants to access the data in a self-service manner, without the help of IT.

The Executive

An executive is usually a person in senior management looking for high-level analyses that can help them make important business decisions. For example, an executive could be looking for predictive analytics of product sales based on history and analytical models built by data scientists. In an integrated data lake management platform, data would be ingested from various sources—some streaming, some batch—and then processed in batches to come up with insights, with the final data able to be visualized using Tableau or Excel. Another common example is an executive who needs a 360-degree view of a customer, including metrics from every level of the organization—pre-sales, sales, and customer support—in a single report.

The Data Scientist

Data scientists are typically looking at the datasets and trying to build models on top of them, performing exploratory ad hoc analyses to prove or come up with a thesis about what they see. Data scientists who want to build and test their models will find a data lake useful because it gives them access to all of the data, not just a sample. Additionally, they can build scripts in Python and run them on a cluster to get a response within hours rather than days.

The Business Analyst

Business analysts usually try to correlate some of the datasets and create an aggregated view to slice and dice using a business intelligence or visualization tool. With a traditional data warehouse, business analysts had to come up with reporting requirements and wait for IT to build a report or export the data on their behalf. Now, business analysts can ask “what if” questions from data lakes on their own. For example, an analyst might ask how much effect weather patterns had on sales based on historical data and information from public datasets combined with in-house datasets in the data lake. Without involving IT, the analyst could consult the catalog to see what datasets have been cleaned and standardized and run queries against that data.

The Downstream System

A fourth type of consumer is a downstream system, such as an application or a platform, which receives the raw or refined data. Leading companies are building new applications and products on top of their data lake, so they are also consumers of the data. They might also use RESTful APIs or some other API mechanisms on an ongoing manner. For example, if the downstream application is a database, the data lake can ingest and transform the data and then send the final aggregated data to the downstream system for storage.

Self-Service

The purpose of a data lake is to provide value to the business by serving users. From a user perspective, here are the most important questions to ask about the data:

What is in the data lake (the catalog)?
What is the quality of the data?
What is the profile of the data?
What is the metadata of the data?
How can users do enrichments, clean-ups, enhancements, and aggregations without going to IT (how to use the data lake in a self-service way)?
How can users annotate and tag the data?

Answering these questions requires that proper architecture, governance, and security rules are put in place and adhered to so that the appropriate people gain access to the relevant data in a timely manner. There also needs to be strict governance in the onboarding of datasets, naming conventions must be established and enforced, and security policies need to be in place to ensure role-based access control.

For our purposes, self-service means that nontechnical business users can access and analyze data without involving IT. In a self-service model, users should be able to see the metadata and profiles and understand what the attributes of each dataset mean. The metadata must provide enough information for users to create new data formats out of existing data formats by using enrichments and analytics.

Also, in a self-service model, the catalog will be the foundation for users to register all of the different datasets in the data lake. This means that users can go to the data lake and search to find the datasets they need. They should also be able to search on any kind of attribute; for example, on a time window such as January 1st to February 1st, or based on a subject area, such as marketing versus finance. Users should also be able to find datasets based on attributes; for example, they could enter, “Show me all of the datasets that have a field called discount or percentage.”

It is in the self-service capability that best practices for the various types of metadata come into play. Business users are interested in the business metadata, such as the source systems, the frequency with which the data comes in, and the descriptions of the datasets or attributes. Users are also interested in knowing the technical metadata: the structure, format, and schema of the data.

When it comes to operational data, users want to see information about lineage, including when the data was ingested into the data lake, and whether it was raw at the time of ingestion. If the data was not raw when ingested, users should be able to see how was it created and what other datasets were used to create it. Also important to operational data is the quality of the data. Users should be able to define certain rules about data quality, and use them to perform checks on the datasets.

Users might also want to see the ingestion history. If a user is looking at streaming data, for example, they might search for days where no data came in, as a way of ensuring that those days are not included in the representative datasets for campaign analytics. Overall, access to lineage information, the ability to perform quality checks, and ingestion history give business users a good sense of the data, making it possible for them to quickly begin analytics.

Controlling Access

Many IT organizations are simply overwhelmed by the sheer volume of datasets—small, medium, and large—that are related but not integrated when they are stored in data lakes. However, when done right, data lakes allow organizations to gain insights and discover relationships between datasets.

When providing various users—whether C-level executives, business analysts, or data scientists—with the tools they need, security is critical. Setting and enforcing the security policies consistently is essential for successful use of a data lake. In-memory technologies should support different access patterns for each user group, depending on their needs. For example, a report generated for a C-level executive might be very sensitive and should not be available to others who don’t have the same access privileges. Data scientists might need more flexibility, with lesser amounts of governance; for this group, you might create a sandbox for exploratory work. By the same token, users in a company’s marketing department should not have access to the same data as users in the finance department. With security policies in place, users have access only to the datasets assigned to their privilege levels.

You can also use security features to enable users to interact with the data and contribute to data preparation and enrichment. For example, as users find data in the data lake through the catalog, they can be allowed to clean up the data and enrich the fields in a dataset in a self-service manner.

Access controls can also enable a collaborative approach for accessing and consuming the data. For example, if one user finds a dataset that is important to a project, and there are three other team members on that same project, the user can create a shared workspace with that data so that the team can collaborate on enrichments.

Crowdsourcing

A bottom-up approach to data governance enables you to rank the usefulness of datasets by crowdsourcing. By asking users to rate which datasets are the most valuable, the word can spread to other users so that they can make productive use of that data.

To do this, you need a rating and ranking mechanism as part of your integrated data lake management platform. The obvious place for this bottom-up, watermark-based governance model would be the catalog. Thus, the catalog must have rating functions.

But it’s not enough to show what others think of a dataset. An integrated data lake management and governance solution should show users the rankings of the datasets from all users, but it should also offer a personalized data rating, so that each individual can see what they have personally found useful whenever they go to the catalog.

Users also need tools to create new data models out of existing datasets. For example, users should be able to take a customer data set and a transaction dataset and create a “most valuable customer” dataset by grouping customers by transactions and determining when customers are generating the most revenue. Being able to do these types of enrichments and transformations is important from an end-to-end perspective.

Data Lakes in Different Industries

The data lake provides value in many different areas. Following are some examples industries that benefit from using a data lake to store, transform, and access information.

Health and Life Sciences

Data lakes allow health and life sciences organizations and companies to store and access widely disparate records of both structured and unstructured data in their native formats for later analysis. This avoids the need to force a single categorization of each data type, as would be the case in a traditional data warehouse. Not incidentally, preserving the native format also helps maintain data provenance and fidelity of the data, enabling different analyses to be performed using different contexts. With data lakes, sophisticated data analysis projects are now possible because the data lakes enable distributed big data processing using broadly accepted, open software standards and massively parallel commodity hardware.

Providers

Many large healthcare providers maintain millions of records for millions of patients, including semi-structured reports such as radiology images, unstructured doctors’ notes, and data captured in spreadsheets and other common computer applications. Also, new models of collaborative care require constant ingestion of new data, integration of massive amounts of data, and updates in near real time to patient records. Data also is being used for predictive analytics for population health management and to help hospitals anticipate and reduce preventable readmissions.

Payers

Many major health insurers support the accountable care organization (ACO) model, which reimburses providers with pay-for-performance, outcome-based-reimbursement incentives. Payers need outcomes data to calculate provider outcomes scores and set reimbursement levels. Also, data management is essential to determine baseline performance and meet Centers for Medicare and Medicaid Services (CMS) requirements for data security, privacy, and HIPAA Safe Harbor guidelines. Additionally, payers are taking advantage of data analytics to predict and minimize claims fraud.

Pharmaceutical industry

R&D for drug development involves enormous volumes of data and many data types, including clinical details, images, labs, and sensor data. Because drug development takes years, any streamlining of processes can pay big dividends. In addition to cost-effective data storage and management, some pharmaceutical companies are using managed data lakes to increase the efficiency of clinical trials, such as speeding up patient recruitment and reducing costs with risk-based monitoring approaches.

Personalized medicine

We’re heading in the direction where we’ll use data about our DNA, microbiome, nutrition, sleep patterns, and more to customize more effective treatments for disease. A data lake allows for the collection of hundreds of gigabytes of data per person, generated by wearable sensors and other monitoring devices. Integrating this data and developing predictive models requires advanced analytics approaches, making data lakes and self-service data preparation key.

Financial Services

In the financial services industry, managed data lakes can be used to comply with regulatory reporting requirements, detect fraud, more accurately predict financial trends, and improve and personalize the customer experience.

By consolidating multiple enterprise data warehouses into one data lake, financial institutions can move reconciliation, settlement, and regulatory reporting, such as Dodd-Frank, to a single platform. This dramatically reduces the heavy lifting of integration because data is stored in a standard yet flexible format that can accommodate unstructured data.

Retail banking also has important use cases for data lakes. In this field, large institutions need to process thousands of applications for new checking and savings accounts on a daily basis. Bankers that accept these applications consult third-party risk scoring services before opening an account, yet it is common for bank risk analysts to manually override negative recommendations for applicants with poor banking histories. Although these overrides can happen for good reasons (say there are extenuating circumstances for a particular person’s application), high-risk accounts tend to be overdrawn and cost banks millions of dollars in losses due to mismanagement or fraud.

By moving to a data lake, banks can store and analyze multiple data streams and help regional managers control account risk in distributed branches. They are able to find out which risk analysts make account decisions that go against risk information by third parties. Creation of a centralized data catalog of the data in the data lake also supports increased access of nontechnical staff such as attorneys, who can quickly perform self-service data analytics. The net result is better control of fraud. Over time, the accumulation of data in the data lake allows the bank to build algorithms that automatically detect subtle but high-risk patterns that bank risk analysts might have previously failed to identify.

Telecommunications

The telecommunications sector has some unique challenges as revenues continue to decline due to increased competition, commoditization of products and services, and increased resort to the internet in place of more lucrative voice and messaging services. These trends have made data analytics extremely important to telecommunications companies for delivering better services, discovering competitive advantages, adding new revenue streams, and finding efficiencies.

Telecommunications is extremely rich when it comes to subscriber usage data, including which services customers use and where and when they use them. A managed data lake enables telco operators to more effectively take advantage of their data; for example, for new revenue streams. One interesting use case is to monetize the data and sell insights to companies for marketing or other purposes.

Also, customer service can be a strong differentiator in the telecommunications sector. A managed data lake is an excellent solution to support analytics for improving customer experience and delivering more targeted offers such as tiered pricing or customized data packages. Another valuable use case is using a data lake and data analytics to more efficiently guide deployment of new networks, reducing capital investment and operational costs.

Retail

Retailers are challenged to integrate data from many sources, including ecommerce, enterprise resource planning (ERP) and customer relationship management (CRM) systems, social media, customer support, transactional data, market research, emails, supply chain data, call records, and more to create a complete, 360-degree customer view. A more complete customer profile can help retailers to improve customer service, enhance marketing and loyalty programs, and develop new products.

Loyalty programs that track customer information and transactions and use that data to create more targeted and personalized rewards and experiences can entice customers to not only to shop again, but to spend more or shop more often. A managed data lake can serve as a single repository for all customer data, and support the advanced analytics used to profile customers and optimize a loyalty program.

Personalized offers and recommendations are basic customer expectations today. A managed data lake and self-service data preparation platform for analytics enable retailers to collect nearly real-time or streaming data and use it to deliver personalized customer experiences in stores and online. For example, by capturing web session data (session histories of all users on a page), retailers can provide timely offers based on a customer’s web browsing and shopping history.

Another valuable use case for a managed data lake in retail is product development. Big data analytics and data science can help companies expand the adoption of successful products and services by identifying opportunities in underserved geographies or predicting what customers want.

Get Architecting Data Lakes, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Architecting Data Lakes, 2nd Edition by Ben Sharma