O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Operationalizing the Data Lake

Book Description

Big data and advanced analytics have increasingly moved to the cloud as organizations pursue actionable insights and data-driven products using the growing amounts of information they collect. But few companies have truly operationalized data so it’s usable for the entire organization. With this pragmatic ebook, engineers, architects, and data managers will learn how to build and extract value from a data lake in the cloud and leverage the compute power and scalability of a cloud-native data platform to put your company’s vast data trove into action.

Holden Ackerman and Jon King of Qubole take you through the basics of building a data lake operation, from people to technology, employing multiple technologies and frameworks in a cloud-native data platform. You'll dive into the tools and processes you need for the entire lifecycle of a data lake, from data preparation, storage, and management to distributed computing and analytics. You’ll also explore the unique role that each member of your data team needs to play as you migrate to your cloud-native data platform.

  • Leverage your data effectively through a single source of truth
  • Understand the importance of building a self-service culture for your data lake
  • Define the structure you need to build a data lake in the cloud
  • Implement financial governance and data security policies for your data lake through a cloud-native data platform
  • Identify the tools you need to manage your data infrastructure
  • Delineate the scope, usage rights, and best tools for each team working with a data lake—analysts, data scientists, data engineers, and security professionals, among others

Table of Contents

  1. Acknowledgments
  2. Foreword
  3. Introduction
    1. Overview: Big Data’s Big Journey to the Cloud
    2. My Journey to a Data Lake
    3. A Quick History Lesson on Big Data
    4. The Second Phase of Big Data Development
    5. Weather Update: Clouds Ahead
    6. Bringing Big Data and Cloud Together
    7. Commercial Cloud Distributions: The Formative Years
    8. Big Data and AI Move Decisively to the Cloud, but Operationalizing Initiatives Lag
    9. We Believe in the Cloud for Big Data and AI
  4. 1. The Data Lake: A Central Repository
    1. What Is a Data Lake?
    2. Data Lakes and the Five Vs of Big Data
    3. Data Lake Consumers and Operators
      1. Operators
      2. Consumers (Both Internal and External)
    4. Challenges in Operationalizing Data Lakes
  5. 2. The Importance of Building a Self-Service Culture
    1. The End Goal: Becoming a Data-Driven Organization
      1. Foster a Culture of Data-Driven Decision Making
      2. Build an Organizational Structure That Supports a Self-Service Culture
      3. Putting a Self-Service Technological Infrastructure in Place
    2. Challenges of Building a Self-Service Infrastructure
      1. Lack of Specialized Expertise
      2. Disparity and Distribution of Data
      3. Organizational Resistance
      4. Reluctance to Commit to Open Source
  6. 3. Getting Started Building Your Data Lake
    1. The Benefits of Moving a Data Lake to the Cloud
      1. Key Benefit: The Ability to Separate Compute and Storage
    2. When Moving from an Enterprise Data Warehouse to a Data Lake
      1. Cloud Data Warehouse
      2. Distributed SQL
    3. How Companies Adopt Data Lakes: The Maturity Model
      1. Stage 1: Aspiration—Thinking About Moving Away from the Data Warehouse
      2. Stage 2: Experimentation—Moving from a Data Warehouse to a Data Lake
      3. Stage 3: Expansion—Moving the Data Lake to the Cloud
      4. Stage 4: Inversion
      5. Stage 5: Nirvana
  7. 4. Setting the Foundation for Your Data Lake
    1. Setting Up the Storage for the Data Lake
      1. Immutable Raw Storage Bucket
      2. Optimized Storage Bucket
      3. Scratch Database
    2. The Sources of Data
    3. Getting Data into the Data Lake
    4. Automating Metadata Capture
    5. Data Types
      1. Structured Data
      2. Semi-Structured Data
      3. Unstructured Data
    6. Storage Management in the Cloud
    7. Data Governance
  8. 5. Governing Your Data Lake
    1. Data Governance
    2. Privacy and Security in the Cloud
      1. Security Governance
    3. Financial Governance
      1. A Deeper Dive into Why the Cloud Makes Solid Financial Sense
      2. How to Mitigate Cloud Costs: Autoscaling
      3. Spot Instances
    4. Measuring Financial Impact
      1. Qubole’s Approach to Autoscaling
  9. 6. Tools for Making the Data Lake Platform
    1. The Six-Step Model for Operationalizing a Cloud-Native Data Lake
      1. Step 1: Ingest Data
      2. Step 2: Store, Monitor, and Manage Your Data
      3. Step 3: Prepare and Train Data
    2. The Importance of Data Confidence
      1. Tools for Data Preparation
      2. Step 4: Model and Serve Data
    3. Tools for Deploying Machine Learning in the Cloud
      1. Open Source Machine Learning Tools
      2. Managed Machine Learning Services
      3. Cloud Machine Learning Services
        1. Step 5: Extract Intelligence
        2. Tools for Extracting Intelligence
      4. Getting Data Out of Your Data Lake
        1. Presto for Ad Hoc Analytics
      5. Step 6: Productionize and Automate
    4. Tools for Moving to Production and Automating
      1. Open Source Workflow Schedulers
      2. ETL Managed Services
  10. 7. Securing Your Data Lake
    1. Consideration 1: Understand the Three “Distinct Parties” Involved in Cloud Security
    2. Consideration 2: Expect a Lot of Noise from Your Security Tools
    3. Consideration 3: Protect Critical Data
    4. Consideration 4: Use Big Data to Enhance Security
  11. 8. Considerations for the Data Engineer
    1. Top Considerations for Data Engineers Using a Data Lake in the Cloud
      1. Protect Your Users
      2. Ensure That Data Governance Is in Place
      3. Designate Areas for Raw and Optimal Data Storage
    2. Considerations for Data Engineers in the Cloud
    3. Summary
  12. 9. Considerations for the Data Scientist
    1. Data Scientists Versus Machine Learning Engineers: What’s the Difference?
      1. Data Scientist Use Cases
      2. How a Data Scientist Begins a Project
    2. Top Considerations for Data Scientists Using a Data Lake in the Cloud
  13. 10. Considerations for the Data Analyst
    1. A Typical Experience for a Data Analyst
      1. Top Considerations for Data Analysts Using a Data Lake in the Cloud
  14. 11. Case Study: Ibotta Builds a Cost-Efficient, Self-Service Data Lake
  15. 12. Conclusion
    1. Best Practices for Operationalizing the Data Lake
    2. General Best Practices