book

Operationalizing the Data Lake

by Holden Ackerman, Jon King

July 2019

Beginner to intermediate

171 pages

English

O'Reilly Media, Inc.

Read now

Unlock full access

Acknowledgments
Foreword
Introduction
Overview: Big Data’s Big Journey to the CloudMy Journey to a Data LakeA Quick History Lesson on Big DataThe Second Phase of Big Data DevelopmentWeather Update: Clouds AheadBringing Big Data and Cloud TogetherCommercial Cloud Distributions: The Formative YearsBig Data and AI Move Decisively to the Cloud, but Operationalizing Initiatives LagWe Believe in the Cloud for Big Data and AI
1. The Data Lake: A Central Repository
What Is a Data Lake?Data Lakes and the Five Vs of Big DataData Lake Consumers and OperatorsOperatorsConsumers (Both Internal and External)Challenges in Operationalizing Data Lakes
2. The Importance of Building a Self-Service Culture
The End Goal: Becoming a Data-Driven OrganizationFoster a Culture of Data-Driven Decision MakingBuild an Organizational Structure That Supports a Self-Service CulturePutting a Self-Service Technological Infrastructure in PlaceChallenges of Building a Self-Service InfrastructureLack of Specialized ExpertiseDisparity and Distribution of DataOrganizational ResistanceReluctance to Commit to Open Source
3. Getting Started Building Your Data Lake
The Benefits of Moving a Data Lake to the CloudKey Benefit: The Ability to Separate Compute and StorageWhen Moving from an Enterprise Data Warehouse to a Data LakeCloud Data Warehouse Distributed SQLHow Companies Adopt Data Lakes: The Maturity ModelStage 1: Aspiration—Thinking About Moving Away from the Data WarehouseStage 2: Experimentation—Moving from a Data Warehouse to a Data LakeStage 3: Expansion—Moving the Data Lake to the CloudStage 4: InversionStage 5: Nirvana
4. Setting the Foundation for Your Data Lake
Setting Up the Storage for the Data LakeImmutable Raw Storage BucketOptimized Storage BucketScratch DatabaseThe Sources of DataGetting Data into the Data LakeAutomating Metadata CaptureData TypesStructured DataSemi-Structured DataUnstructured DataStorage Management in the CloudData Governance
5. Governing Your Data Lake
Data GovernancePrivacy and Security in the CloudSecurity GovernanceFinancial GovernanceA Deeper Dive into Why the Cloud Makes Solid Financial SenseHow to Mitigate Cloud Costs: AutoscalingSpot InstancesMeasuring Financial ImpactQubole’s Approach to Autoscaling
6. Tools for Making the Data Lake Platform
The Six-Step Model for Operationalizing a Cloud-Native Data LakeStep 1: Ingest DataStep 2: Store, Monitor, and Manage Your DataStep 3: Prepare and Train DataThe Importance of Data ConfidenceTools for Data PreparationStep 4: Model and Serve DataTools for Deploying Machine Learning in the CloudOpen Source Machine Learning ToolsManaged Machine Learning ServicesCloud Machine Learning ServicesStep 5: Extract IntelligenceTools for Extracting IntelligenceGetting Data Out of Your Data LakePresto for Ad Hoc AnalyticsStep 6: Productionize and AutomateTools for Moving to Production and AutomatingOpen Source Workflow SchedulersETL Managed Services
7. Securing Your Data Lake
Consideration 1: Understand the Three “Distinct Parties” Involved in Cloud SecurityConsideration 2: Expect a Lot of Noise from Your Security ToolsConsideration 3: Protect Critical DataConsideration 4: Use Big Data to Enhance Security

8. Considerations for the Data Engineer
Top Considerations for Data Engineers Using a Data Lake in the CloudProtect Your UsersEnsure That Data Governance Is in PlaceDesignate Areas for Raw and Optimal Data StorageConsiderations for Data Engineers in the CloudSummary
9. Considerations for the Data Scientist
Data Scientists Versus Machine Learning Engineers: What’s the Difference?Data Scientist Use CasesHow a Data Scientist Begins a ProjectTop Considerations for Data Scientists Using a Data Lake in the Cloud
10. Considerations for the Data Analyst
A Typical Experience for a Data AnalystTop Considerations for Data Analysts Using a Data Lake in the Cloud
11. Case Study: Ibotta Builds a Cost-Efficient, Self-Service Data Lake
12. Conclusion
Best Practices for Operationalizing the Data LakeGeneral Best Practices

Overview

Big data and advanced analytics have increasingly moved to the cloud as organizations pursue actionable insights and data-driven products using the growing amounts of information they collect. But few companies have truly operationalized data so it’s usable for the entire organization. With this pragmatic ebook, engineers, architects, and data managers will learn how to build and extract value from a data lake in the cloud and leverage the compute power and scalability of a cloud-native data platform to put your company’s vast data trove into action.

Holden Ackerman and Jon King of Qubole take you through the basics of building a data lake operation, from people to technology, employing multiple technologies and frameworks in a cloud-native data platform. You'll dive into the tools and processes you need for the entire lifecycle of a data lake, from data preparation, storage, and management to distributed computing and analytics. You’ll also explore the unique role that each member of your data team needs to play as you migrate to your cloud-native data platform.

Leverage your data effectively through a single source of truth
Understand the importance of building a self-service culture for your data lake
Define the structure you need to build a data lake in the cloud
Implement financial governance and data security policies for your data lake through a cloud-native data platform
Identify the tools you need to manage your data infrastructure
Delineate the scope, usage rights, and best tools for each team working with a data lake—analysts, data scientists, data engineers, and security professionals, among others

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492049517

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills