book

The Enterprise Big Data Lake

by Alex Gorelik

March 2019

Beginner to intermediate

221 pages

6h 35m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who Should Read This Book?Conventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
Data Lake MaturityData PuddlesData PondsCreating a Successful Data LakeThe Right PlatformThe Right DataThe Right InterfaceThe Data SwampRoadmap to Data Lake SuccessStanding Up a Data LakeOrganizing the Data LakeSetting Up the Data Lake for Self-ServiceData Lake ArchitecturesData Lakes in the Public CloudLogical Data LakesConclusion
The Drive for Self-Service Data—The Birth of DatabasesThe Analytics Imperative—The Birth of Data WarehousingThe Data Warehouse EcosystemStoring and Querying the DataLoading the Data—Data Integration ToolsOrganizing and Managing the DataConsuming the DataConclusion
Hadoop Leads the Historic Shift to Big DataThe Hadoop File SystemHow Processing and Storage Interact in a MapReduce JobSchema on ReadHadoop ProjectsData ScienceWhat Should Your Analytics Organization Focus On?Machine LearningExplainabilityChange ManagementConclusion
The What and Why of HadoopPreventing Proliferation of Data PuddlesTaking Advantage of Big DataLeading with Data ScienceStrategy 1: Offload Existing FunctionalityStrategy 2: Data Lakes for New ProjectsStrategy 3: Establish a Central Point of GovernanceWhich Way Is Right for You?Conclusion
Essential Functions of a Data WarehouseDimensional Modeling for AnalyticsIntegrating Data from Disparate SourcesPreserving History Using Slowly Changing DimensionsLimitations of the Data Warehouse as a Historical RepositoryMoving to a Data PondKeeping History in a Data PondImplementing Slowly Changing Dimensions in a Data PondGrowing Data Ponds into a Data Lake—Loading Data That’s Not in the Data WarehouseRaw DataExternal DataInternet of Things (IoT) and Other Streaming DataReal-Time Data LakesThe Lambda ArchitectureData TransformationsTarget SystemsData WarehousesOperational Data StoresReal-Time Applications and Data ProductsConclusion
The Beginnings of Self-ServiceBusiness AnalystsFinding and Understanding Data—Documenting the EnterpriseEstablishing TrustProvisioningPreparing Data for AnalysisData Wrangling in the Data LakeSituating Data Preparation in HadoopCommon Use Cases for Data PreparationAnalyzing and VisualizingThe New World of Self-Service Business IntelligenceThe New Analytic WorkflowGatekeepers to ShopkeepersGoverning Self-ServiceConclusion
Organizing the Data LakeLanding or Raw ZoneGold ZoneWork ZoneSensitive ZoneMultiple Data LakesAdvantages of Keeping Data Lakes SeparateAdvantages of Merging the Data LakesCloud Data LakesVirtual Data LakesData FederationBig Data VirtualizationEliminating RedundancyConclusion
Organizing the DataTechnical MetadataBusiness MetadataTaggingAutomated CatalogingLogical Data ManagementSensitive Data Management and Access ControlData QualityRelating Disparate DataEstablishing LineageData ProvisioningTools for Building a CatalogTool ComparisonThe Data OceanConclusion
Authorization or Access ControlTag-Based Data Access PoliciesDeidentifying Sensitive DataData Sovereignty and Regulatory ComplianceSelf-Service Access ManagementProvisioning DataConclusion

Big Data in Financial ServicesConsumers, Digitization, and Data Are Changing Finance as We Know ItSaving the BankNew Opportunities Offered by New DataKey Processes in Making Use of the Data LakeValue Added by Data Lakes in Financial ServicesData Lakes in the Insurance IndustrySmart CitiesBig Data in Medicine

Content preview from The Enterprise Big Data Lake

Chapter 4. Starting a Data Lake

As discussed in the previous chapter, the promise of the data lake is to store the enterprise’s data in a way that maximizes its availability and accessibility for analytics and data science. But what’s the best way to get started? This chapter discusses various paths enterprises take to build a data lake.

Apache Hadoop is an open source project that’s frequently used for this purpose. While there are many other alternatives, especially in the cloud, Hadoop-based data lakes provide a good representation of the advantages they provide, so we are going to use Hadoop as an example. We’ll begin by reviewing what it is and some of its key advantages for supporting a data lake.

The What and Why of Hadoop

Hadoop is a massively parallel storage and execution platform that automates many of the difficult aspects of building a highly scalable and available cluster. It has its own distributed filesystem, HDFS (although some Hadoop distributions, like MapR and IBM, provide their own filesystems to replace HDFS). HDFS automatically replicates data on the cluster to achieve high parallelism and availability. For example, if Hadoop uses the default replication factor of three, it stores each block on three different nodes. This way, when a job needs a block of data, the scheduler has a choice of three different nodes to use and can decide which one is the best based on what other jobs are running on it, what other data is located there, and so forth. Furthermore, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781491931547Errata Page

The Enterprise Big Data Lake

by Alex Gorelik

Chapter 4. Starting a Data Lake

The What and Why of Hadoop

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Data Lake for Enterprises

Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake

Unstructured Data Analytics

Operationalizing the Data Lake

Publisher Resources

Chapter 4. Starting a Data Lake

The What and Why of Hadoop

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Data Lake for Enterprises

Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake

Unstructured Data Analytics

Operationalizing the Data Lake

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.