book

The Enterprise Big Data Lake

Name: The Enterprise Big Data Lake
Author: Alex Gorelik
ISBN: 9781491931554

by Alex Gorelik

March 2019

Beginner to intermediate

221 pages

6h 35m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This Book?Conventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Data Lakes
Data Lake MaturityData PuddlesData PondsCreating a Successful Data LakeThe Right PlatformThe Right DataThe Right InterfaceThe Data SwampRoadmap to Data Lake SuccessStanding Up a Data LakeOrganizing the Data LakeSetting Up the Data Lake for Self-ServiceData Lake ArchitecturesData Lakes in the Public CloudLogical Data LakesConclusion
2. Historical Perspective
The Drive for Self-Service Data—The Birth of DatabasesThe Analytics Imperative—The Birth of Data WarehousingThe Data Warehouse EcosystemStoring and Querying the DataLoading the Data—Data Integration ToolsOrganizing and Managing the DataConsuming the DataConclusion
3. Introduction to Big Data and Data Science
Hadoop Leads the Historic Shift to Big DataThe Hadoop File SystemHow Processing and Storage Interact in a MapReduce JobSchema on ReadHadoop ProjectsData ScienceWhat Should Your Analytics Organization Focus On?Machine LearningExplainabilityChange ManagementConclusion
4. Starting a Data Lake
The What and Why of HadoopPreventing Proliferation of Data PuddlesTaking Advantage of Big DataLeading with Data ScienceStrategy 1: Offload Existing FunctionalityStrategy 2: Data Lakes for New ProjectsStrategy 3: Establish a Central Point of GovernanceWhich Way Is Right for You?Conclusion
5. From Data Ponds/Big Data Warehouses to Data Lakes
Essential Functions of a Data WarehouseDimensional Modeling for AnalyticsIntegrating Data from Disparate SourcesPreserving History Using Slowly Changing DimensionsLimitations of the Data Warehouse as a Historical RepositoryMoving to a Data PondKeeping History in a Data PondImplementing Slowly Changing Dimensions in a Data PondGrowing Data Ponds into a Data Lake—Loading Data That’s Not in the Data WarehouseRaw DataExternal DataInternet of Things (IoT) and Other Streaming DataReal-Time Data LakesThe Lambda ArchitectureData TransformationsTarget SystemsData WarehousesOperational Data StoresReal-Time Applications and Data ProductsConclusion
6. Optimizing for Self-Service
The Beginnings of Self-ServiceBusiness AnalystsFinding and Understanding Data—Documenting the EnterpriseEstablishing TrustProvisioningPreparing Data for AnalysisData Wrangling in the Data LakeSituating Data Preparation in HadoopCommon Use Cases for Data PreparationAnalyzing and VisualizingThe New World of Self-Service Business IntelligenceThe New Analytic WorkflowGatekeepers to ShopkeepersGoverning Self-ServiceConclusion
7. Architecting the Data Lake
Organizing the Data LakeLanding or Raw ZoneGold ZoneWork ZoneSensitive ZoneMultiple Data LakesAdvantages of Keeping Data Lakes SeparateAdvantages of Merging the Data LakesCloud Data LakesVirtual Data LakesData FederationBig Data VirtualizationEliminating RedundancyConclusion
8. Cataloging the Data Lake
Organizing the DataTechnical MetadataBusiness MetadataTaggingAutomated CatalogingLogical Data ManagementSensitive Data Management and Access ControlData QualityRelating Disparate DataEstablishing LineageData ProvisioningTools for Building a CatalogTool ComparisonThe Data OceanConclusion
9. Governing Data Access
Authorization or Access ControlTag-Based Data Access PoliciesDeidentifying Sensitive DataData Sovereignty and Regulatory ComplianceSelf-Service Access ManagementProvisioning DataConclusion

10. Industry-Specific Perspectives
Big Data in Financial ServicesConsumers, Digitization, and Data Are Changing Finance as We Know ItSaving the BankNew Opportunities Offered by New DataKey Processes in Making Use of the Data LakeValue Added by Data Lakes in Financial ServicesData Lakes in the Insurance IndustrySmart CitiesBig Data in Medicine
Index

Overview

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book.

Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries.

Get a succinct introduction to data warehousing, big data, and data science
Learn various paths enterprises take to build a data lake
Explore how to build a self-service model and best practices for providing analysts access to the data
Use different methods for architecting your data lake
Discover ways to implement a data lake from experts in different industries

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491931547Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills