Architecting Data Lakes

Book description

Many organizations use Hadoop-driven data lakes as an adjunct staging area for their enterprise data warehouses (EDW). But for those companies ready to take the plunge, a data lake is far more useful as a one-stop-shop for extracting insights from their vast collection of data. With this eBook, you’ll learn best practices for building, maintaining, and deriving value from a Hadoop data lake in production environments.

Authors Alice LaPlante and Ben Sharma explain how a data lake will enable your organization to manage an increasing volume of datasets—from blog postings and product reviews to streaming data—and to discover important relationships between them. Whether you want to control administrative costs in healthcare or reduce risk in financial services, this ebook addresses the architectural considerations and required capabilities you need to build your own data lake.

With this report, you’ll learn:

  • The key attributes of a data lake, including its ability to store information in native formats for later processing
  • Why implementing data management and governance in your data lake is crucial
  • How to address various challenges for building and managing a data lake
  • Self-service options that enable different users to access the data lake without help from IT
  • Emerging trends that will shape the future of data lakes

Table of contents

  1. 1. Overview
    1. What Is a Data Lake?
      1. Drawbacks of the Traditional EDW
      2. Key Attributes of a Data Lake
      3. The Business Case for Data Lakes
    2. Data Management and Governance in the Data Lake
      1. Address the Challenge Later
      2. Adapt Existing Legacy Tools
      3. Write Custom Scripts
      4. Deploy a Data Lake Management Platform
    3. How to Deploy a Data Lake Management Platform
  2. 2. How Data Lakes Work
    1. Four Basic Functions of a Data Lake
      1. Data Ingestion
      2. Data Storage and Retention
      3. Data Processing
      4. Data Access
    2. Management and Monitoring
      1. A Combined Approach
      2. Metadata
  3. 3. Challenges and Complications
    1. Challenges of Building a Data Lake
      1. Rate of Change
      2. Acquiring Skilled Personnel
      3. Technological Complexity
    2. Challenges of Managing the Data Lake
      1. Ingestion
      2. Lack of Visibility
      3. Privacy and Compliance
    3. Deriving Value from the Data Lake
      1. Reusability
  4. 4. Curating the Data Lake
    1. Data Governance
      1. Integrating a Data Lake Management Solution
    2. Data Acquisition
    3. Data Organization
      1. Data Catalog
    4. Capturing Metadata
    5. Data Preparation
    6. Data Provisioning
      1. The Executive
      2. The Data Scientist
      3. The Business Analyst
      4. A Downstream System
    7. Benefits of an Automated Approach
  5. 5. Deriving Value from the Data Lake
    1. Self-Service
    2. Controlling and Allowing Access
    3. Using a Bottom-Up Approach to Data Governance to Rank Data Sets
    4. Data Lakes in Different Industries
      1. Healthcare
      2. Financial Services
      3. Retail
  6. 6. Looking Ahead
    1. Ground-to-Cloud Deployment Options
    2. Looking Beyond Hadoop: Logical Data Lakes
    3. Federated Queries
    4. Data Discovery Portals
    5. In Conclusion
    6. A Checklist for Success

Product information

  • Title: Architecting Data Lakes
  • Author(s): Ashish Thusoo, Ben Sharma
  • Release date: April 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491952597