O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The Enterprise Big Data Lake

Book Description

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book.

Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries.

  • Get a succinct introduction to data warehousing, big data, and data science
  • Learn various paths enterprises take to build a data lake
  • Explore how to build a self-service model and best practices for providing analysts access to the data
  • Use different methods for architecting your data lake
  • Discover ways to implement a data lake from experts in different industries

Table of Contents

  1. Preface
    1. Who Should Read This Book?
    2. Conventions Used in This Book
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  2. 1. Introduction to Data Lakes
    1. Data Lake Maturity
      1. Data Puddles
      2. Data Ponds
    2. Creating a Successful Data Lake
      1. The Right Platform
      2. The Right Data
      3. The Right Interface
      4. The Data Swamp
    3. Roadmap to Data Lake Success
      1. Standing Up a Data Lake
      2. Organizing the Data Lake
      3. Setting Up the Data Lake for Self-Service
    4. Data Lake Architectures
      1. Data Lakes in the Public Cloud
      2. Logical Data Lakes
    5. Conclusion
  3. 2. Historical Perspective
    1. The Drive for Self-Service Data—The Birth of Databases
    2. The Analytics Imperative—The Birth of Data Warehousing
    3. The Data Warehouse Ecosystem
      1. Storing and Querying the Data
      2. Loading the Data—Data Integration Tools
      3. Organizing and Managing the Data
      4. Consuming the Data
    4. Conclusion
  4. 3. Introduction to Big Data and Data Science
    1. Hadoop Leads the Historic Shift to Big Data
      1. The Hadoop File System
      2. How Processing and Storage Interact in a MapReduce Job
      3. Schema on Read
      4. Hadoop Projects
    2. Data Science
    3. What Should Your Analytics Organization Focus On?
    4. Machine Learning
      1. Explainability
      2. Change Management
    5. Conclusion
  5. 4. Starting a Data Lake
    1. The What and Why of Hadoop
    2. Preventing Proliferation of Data Puddles
    3. Taking Advantage of Big Data
      1. Leading with Data Science
      2. Strategy 1: Offload Existing Functionality
      3. Strategy 2: Data Lakes for New Projects
      4. Strategy 3: Establish a Central Point of Governance
      5. Which Way Is Right for You?
    4. Conclusion
  6. 5. From Data Ponds/Big Data Warehouses to Data Lakes
    1. Essential Functions of a Data Warehouse
      1. Dimensional Modeling for Analytics
      2. Integrating Data from Disparate Sources
      3. Preserving History Using Slowly Changing Dimensions
      4. Limitations of the Data Warehouse as a Historical Repository
    2. Moving to a Data Pond
      1. Keeping History in a Data Pond
      2. Implementing Slowly Changing Dimensions in a Data Pond
    3. Growing Data Ponds into a Data Lake—Loading Data That’s Not in the Data Warehouse
      1. Raw Data
      2. External Data
      3. Internet of Things (IoT) and Other Streaming Data
    4. Real-Time Data Lakes
    5. The Lambda Architecture
    6. Data Transformations
    7. Target Systems
      1. Data Warehouses
      2. Operational Data Stores
      3. Real-Time Applications and Data Products
    8. Conclusion
  7. 6. Optimizing for Self-Service
    1. The Beginnings of Self-Service
    2. Business Analysts
      1. Finding and Understanding Data—Documenting the Enterprise
      2. Establishing Trust
      3. Provisioning
      4. Preparing Data for Analysis
    3. Data Wrangling in the Data Lake
      1. Situating Data Preparation in Hadoop
      2. Common Use Cases for Data Preparation
    4. Analyzing and Visualizing
    5. The New World of Self-Service Business Intelligence
      1. The New Analytic Workflow
      2. Gatekeepers to Shopkeepers
      3. Governing Self-Service
    6. Conclusion
  8. 7. Architecting the Data Lake
    1. Organizing the Data Lake
      1. Landing or Raw Zone
      2. Gold Zone
      3. Work Zone
      4. Sensitive Zone
    2. Multiple Data Lakes
      1. Advantages of Keeping Data Lakes Separate
      2. Advantages of Merging the Data Lakes
    3. Cloud Data Lakes
    4. Virtual Data Lakes
      1. Data Federation
      2. Big Data Virtualization
      3. Eliminating Redundancy
    5. Conclusion
  9. 8. Cataloging the Data Lake
    1. Organizing the Data
      1. Technical Metadata
      2. Business Metadata
    2. Tagging
      1. Automated Cataloging
    3. Logical Data Management
      1. Sensitive Data Management and Access Control
      2. Data Quality
    4. Relating Disparate Data
    5. Establishing Lineage
    6. Data Provisioning
    7. Tools for Building a Catalog
      1. Tool Comparison
    8. The Data Ocean
    9. Conclusion
  10. 9. Governing Data Access
    1. Authorization or Access Control
    2. Tag-Based Data Access Policies
    3. Deidentifying Sensitive Data
      1. Data Sovereignty and Regulatory Compliance
    4. Self-Service Access Management
      1. Provisioning Data
    5. Conclusion
  11. 10. Industry-Specific Perspectives
    1. Big Data in Financial Services
      1. Consumers, Digitization, and Data Are Changing Finance as We Know It
      2. Saving the Bank
      3. New Opportunities Offered by New Data
      4. Key Processes in Making Use of the Data Lake
    2. Value Added by Data Lakes in Financial Services
    3. Data Lakes in the Insurance Industry
    4. Smart Cities
    5. Big Data in Medicine
  12. Index