O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Lake for Enterprises

Book Description

A practical guide to implementing your enterprise data lake using Lambda Architecture as the base

About This Book

  • Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base
  • Delve into the big data technologies required to meet modern day business strategies
  • A highly practical guide to implementing enterprise data lakes with lots of examples and real-world use-cases

Who This Book Is For

Java developers and architects who would like to implement a data lake for their enterprise will find this book useful. If you want to get hands-on experience with the Lambda Architecture and big data technologies by implementing a practical solution using these technologies, this book will also help you.

What You Will Learn

  • Build an enterprise-level data lake using the relevant big data technologies
  • Understand the core of the Lambda architecture and how to apply it in an enterprise
  • Learn the technical details around Sqoop and its functionalities
  • Integrate Kafka with Hadoop components to acquire enterprise data
  • Use flume with streaming technologies for stream-based processing
  • Understand stream- based processing with reference to Apache Spark Streaming
  • Incorporate Hadoop components and know the advantages they provide for enterprise data lakes
  • Build fast, streaming, and high-performance applications using ElasticSearch
  • Make your data ingestion process consistent across various data formats with configurability
  • Process your data to derive intelligence using machine learning algorithms

In Detail

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together.

This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient.

By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Style and approach

The book takes a pragmatic approach, showing ways to leverage big data technologies and lambda architecture to build an enterprise-level data lake.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Errata
      3. Piracy
      4. Questions
  2. Introduction to Data
    1. Exploring data
    2. What is Enterprise Data?
    3. Enterprise Data Management
    4. Big data concepts
      1. Big data and 4Vs
    5. Relevance of data
    6. Quality of data
    7. Where does this data live in an enterprise?
      1. Intranet (within enterprise)
      2. Internet (external to enterprise)
        1. Business applications hosted in cloud
        2. Third-party cloud solutions
        3. Social data (structured and unstructured)
      3. Data stores or persistent stores (RDBMS or NoSQL)
      4. Traditional data warehouse
      5. File stores
    8. Enterprise's current state
    9. Enterprise digital transformation
      1. Enterprises embarking on this journey
        1. Some examples
    10. Data lake use case enlightenment
    11. Summary
  3. Comprehensive Concepts of a Data Lake
    1. What is a Data Lake?
      1. Relevance to enterprises
    2. How does a Data Lake help enterprises?
      1. Data Lake benefits
    3. How Data Lake works?
    4. Differences between Data Lake and Data Warehouse
    5. Approaches to building a Data Lake
    6. Lambda Architecture-driven Data Lake
      1. Data ingestion layer - ingest for processing and storage
      2. Batch layer - batch processing of ingested data
      3. Speed layer - near real time data processing
      4. Data storage layer - store all data
      5. Serving layer - data delivery and exports
      6. Data acquisition layer - get data from source systems
      7. Messaging Layer - guaranteed data delivery
      8. Exploring the Data Ingestion Layer
      9. Exploring the Lambda layer
        1. Batch layer
        2. Speed layer
        3. Serving layer
          1. Data push
          2. Data pull
        4. Data storage layer
          1. Batch process layer
          2. Speed layer
          3. Serving layer
        5. Relational data stores
          1. Distributed data stores
    7. Summary
  4. Lambda Architecture as a Pattern for Data Lake
    1. What is Lambda Architecture?
    2. History of Lambda Architecture
    3. Principles of Lambda Architecture
      1. Fault-tolerant principle
      2. Immutable Data principle
      3. Re-computation principle
    4. Components of a Lambda Architecture
      1. Batch layer
      2. Speed layer
        1. CAP Theorem
        2. Eventual consistency
      3. Serving layer
    5. Complete working of a Lambda Architecture
    6. Advantages of Lambda Architecture
    7. Disadvantages of Lambda Architectures
    8. Technology overview for Lambda Architecture
    9. Applied lambda
      1. Enterprise-level log analysis
      2. Capturing and analyzing sensor data
      3. Real-time mailing platform statistics
      4. Real-time sports analysis
      5. Recommendation engines
      6. Analyzing security threats
      7. Multi-channel consumer behaviour
    10. Working examples of Lambda Architecture
    11. Kappa architecture
    12. Summary
  5. Applied Lambda for Data Lake
    1. Knowing Hadoop distributions
    2. Selection factors for a big data stack for enterprises
      1. Technical capabilities
      2. Ease of deployment and maintenance
      3. Integration readiness
    3. Batch layer for data processing
      1. The NameNode server
      2. The secondary NameNode Server
      3. Yet Another Resource Negotiator (YARN)
      4. Data storage nodes (DataNode)
      5. Speed layer
      6. Flume for data acquisition
        1. Source for event sourcing
        2. Interceptors for event interception
        3. Channels for event flow
        4. Sink as an event destination
      7. Spark Streaming
        1. DStreams
          1. Data Frames
          2. Checkpointing
        2. Apache Flink
    4. Serving layer
      1. Data repository layer
        1. Relational databases
        2. Big data tables/views
        3. Data services with data indexes
        4. NoSQL databases
      2. Data access layer
        1. Data exports
        2. Data publishing
    5. Summary
  6. Data Acquisition of Batch Data using Apache Sqoop
    1. Context in data lake - data acquisition
      1. Data acquisition layer
      2. Data acquisition of batch data - technology mapping
    2. Why Apache Sqoop
      1. History of Sqoop
      2. Advantages of Sqoop
      3. Disadvantages of Sqoop
    3. Workings of Sqoop
      1. Sqoop 2 architecture
      2. Sqoop 1 versus Sqoop 2
        1. Ease of use
        2. Ease of extension
        3. Security
        4. When to use Sqoop 1 and Sqoop 2
      3. Functioning of Sqoop
      4. Data import using Sqoop
      5. Data export using Sqoop
    4. Sqoop connectors
      1. Types of Sqoop connectors
    5. Sqoop support for HDFS
    6. Sqoop working example
      1. Installation and Configuration
        1. Step 1 - Installing and verifying Java
        2. Step 2 - Installing and verifying Hadoop
        3. Step 3 - Installing and verifying Hue
        4. Step 4 - Installing and verifying Sqoop
        5. Step 5 - Installing and verifying PostgreSQL (RDBMS)
        6. Step 6 - Installing and verifying HBase (NoSQL)
      2. Configure data source (ingestion)
      3. Sqoop configuration (database drivers)
      4. Configuring HDFS as destination
      5. Sqoop Import
        1. Import complete database
        2. Import selected tables
        3. Import selected columns from a table
        4. Import into HBase
      6. Sqoop Export
      7. Sqoop Job
        1. Job command
        2. Create job
        3. List Job
        4. Run Job
        5. Create Job
      8. Sqoop 2
      9. Sqoop in purview of SCV use case
    7. When to use Sqoop
    8. When not to use Sqoop
    9. Real-time Sqooping: a possibility?
    10. Other options
      1. Native big data connectors
      2. Talend
      3. Pentaho's Kettle (PDI - Pentaho Data Integration)
    11. Summary
  7. Data Acquisition of Stream Data using Apache Flume
    1. Context in Data Lake: data acquisition
      1. What is Stream Data?
      2. Batch and stream data
      3. Data acquisition of stream data - technology mapping
      4. What is Flume?
      5. Sqoop and Flume
    2. Why Flume?
      1. History of Flume
      2. Advantages of Flume
      3. Disadvantages of Flume
    3. Flume architecture principles
    4. The Flume Architecture
      1. Distributed pipeline - Flume architecture
      2. Fan Out - Flume architecture
      3. Fan In - Flume architecture
      4. Three tier design - Flume architecture
      5. Advanced Flume architecture
      6. Flume reliability level
    5. Flume event - Stream Data
    6. Flume agent
      1. Flume agent configurations
    7. Flume source
      1. Custom Source
    8. Flume Channel
      1. Custom channel
    9. Flume sink
      1. Custom sink
    10. Flume configuration
    11. Flume transaction management
    12. Other flume components
      1. Channel processor
      2. Interceptor
      3. Channel Selector
      4. Sink Groups
      5. Sink Processor
      6. Event Serializers
    13. Context Routing
    14. Flume working example
      1. Installation and Configuration
        1. Step 1: Installing and verifying Flume
        2. Step 2: Configuring Flume
        3. Step 3: Start Flume
      2. Flume in purview of SCV use case
        1. Kafka Installation
          1. Example 1 - RDBMS to Kafka
          2. Example 2: Spool messages to Kafka
          3. Example 3: Interceptors
          4. Example 4 - Memory channel, file channel, and Kafka channel
    15. When to use Flume
    16. When not to use Flume
    17. Other options
      1. Apache Flink
      2. Apache NiFi
    18. Summary
  8. Messaging Layer using Apache Kafka
    1. Context in Data Lake- messaging layer
      1. Messaging layer
      2. Messaging layer- technology mapping
      3. What is Apache Kafka?
    2. Why Apache Kafka
      1. History of Kafka
      2. Advantages of Kafka
      3. Disadvantages of Kafka
    3. Kafka architecture
      1. Core architecture principles of Kafka
      2. Data stream life cycle
      3. Working of Kafka
      4. Kafka message
      5. Kafka producer
      6. Persistence of data in Kafka using topics
      7. Partitions- Kafka topic division
      8. Kafka message broker
      9. Kafka consumer
        1. Consumer groups
    4. Other Kafka components
      1. Zookeeper
      2. MirrorMaker
    5. Kafka programming interface
      1. Kafka core API's
      2. Kafka REST interface
    6. Producer and consumer reliability
    7. Kafka security
    8. Kafka as message-oriented middleware
    9. Scale-out architecture with Kafka
    10. Kafka connect
    11. Kafka working example
      1. Installation
      2. Producer - putting messages into Kafka
        1. Kafka Connect
      3. Consumer - getting messages from Kafka
      4. Setting up multi-broker cluster
      5. Kafka in the purview of an SCV use case
    12. When to use Kafka
    13. When not to use Kafka
    14. Other options
      1. RabbitMQ
      2. ZeroMQ
      3. Apache ActiveMQ
    15. Summary
  9. Data Processing using Apache Flink
    1. Context in a Data Lake - Data Ingestion Layer
      1. Data Ingestion Layer
      2. Data Ingestion Layer - technology mapping
      3. What is Apache Flink?
    2. Why Apache Flink?
      1. History of Flink
      2. Advantages of Flink
      3. Disadvantages of Flink
    3. Working of Flink
      1. Flink architecture
        1. Client
        2. Job Manager
        3. Task Manager
        4. Flink execution model
      2. Core architecture principles of Flink
      3. Flink Component Stack
      4. Checkpointing in Flink
      5. Savepoints in Flink
      6. Streaming window options in Flink
        1. Time window
        2. Count window
        3. Tumbling window configuration
        4. Sliding window configuration
      7. Memory management
    4. Flink API's
      1. DataStream API
        1. Flink DataStream API example
        2. Streaming connectors
      2. DataSet API
        1. Flink DataSet API example
        2. Table API
      3. Flink domain specific libraries
        1. Gelly - Flink Graph API
        2. FlinkML
        3. FlinkCEP
    5. Flink working example
      1. Installation
      2. Example - data processing with Flink
        1. Data generation
        2. Step 1 - Preparing streams
        3. Step 2 - Consuming Streams via Flink
        4. Step 3 - Streaming data into HDFS
      3. Flink in purview of SCV use cases
        1. User Log Data Generation
        2. Flume Setup
        3. Flink Processors
    6. When to use Flink
    7. When not to use Flink
    8. Other options
      1. Apache Spark
      2. Apache Storm
      3. Apache Tez
    9. Summary
  10. Data Store Using Apache Hadoop
    1. Context for Data Lake - Data Storage and lambda Batch layer
      1. Data Storage and the Lambda Batch Layer
      2. Data Storage and Lambda Batch Layer - technology mapping
      3. What is Apache Hadoop?
    2. Why Hadoop?
      1. History of Hadoop
      2. Advantages of Hadoop
      3. Disadvantages of Hadoop
    3. Working of Hadoop
      1. Hadoop core architecture principles
      2. Hadoop architecture
        1. Hadoop architecture 1.x
        2. Hadoop architecture 2.x
      3. Hadoop architecture components
        1. HDFS
        2. YARN
        3. MapReduce
        4. Hadoop ecosystem
      4. Hadoop architecture in detail
    4. Hadoop ecosystem
      1. Data access/processing components
        1. Apache Pig
        2. Apache Hive
      2. Data storage components
        1. Apache HBase
      3. Monitoring, management and orchestration components
        1. Apache ZooKeeper
        2. Apache Oozie
        3. Apache Ambari
      4. Data integration components
        1. Apache Sqoop
        2. Apache Flume
    5. Hadoop distributions
    6. HDFS and formats
    7. Hadoop for near real-time applications
    8. Hadoop deployment modes
    9. Hadoop working examples
      1. Installation
      2. Data preparation
      3. Hive installation
      4. Example - Bulk Data Load
        1. File Data Load
        2. RDBMS Data Load
      5. Example - MapReduce processing
        1. Text Data as Hive Tables
        2. Avro Data as HIVE Table
      6. Hadoop in purview of SCV use case
        1. Initial directory setup
        2. Data loads
        3. Data visualization with HIVE tables
    10. When not to use Hadoop
    11. Other Hadoop Processing Options
    12. Summary
  11. Indexed Data Store using Elasticsearch
    1. Context in Data Lake: data storage and lambda speed layer
      1. Data Storage and Lambda Speed Layer
      2. Data Storage and Lambda Speed Layer: technology mapping
    2. What is Elasticsearch?
    3. Why Elasticsearch
      1. History of Elasticsearch
      2. Advantages of Elasticsearch
      3. Disadvantages of Elasticsearch
    4. Working of Elasticsearch
      1. Elasticsearch core architecture principles
      2. Elasticsearch terminologies
        1. Document in Elasticsearch
        2. Index in Elasticsearch
          1. What is Inverted Index?
        3. Shard in Elasticsearch
        4. Nodes in Elasticsearch
        5. Cluster in Elasticsearch
    5. Elastic Stack
      1. Elastic Stack - Kibana
      2. Elastic Stack - Elasticsearch
      3. Elastic Stack - Logstash
      4. Elastic Stack - Beats
        1. Elastic Stack - X-Pack
    6. Elastic Cloud
      1. Apache Lucene
        1. How Lucene works
    7. Elasticsearch DSL (Query DSL)
      1. Important queries in Query DSL
    8. Nodes in Elasticsearch
      1. Elasticsearch - master node
      2. Elasticsearch - data node
      3. Elasticsearch - client node
    9. Elasticsearch and relational database
    10. Elasticsearch ecosystem
      1. Elasticsearch analyzers
        1. Built-in analyzers
        2. Custom analyzers
      2. Elasticsearch plugins
    11. Elasticsearch deployment options
    12. Clients for Elasticsearch
    13. Elasticsearch for fast streaming layer
    14. Elasticsearch as a data source
    15. Elasticsearch for content indexing
    16. Elasticsearch and Hadoop
    17. Elasticsearch working example
      1. Installation
      2. Creating and Deleting Indexes
      3. Indexing Documents
      4. Getting Indexed Document
      5. Searching Documents
      6. Updating Documents
      7. Deleting a document
      8. Elasticsearch in purview of SCV use case
        1. Data preparation
          1. Initial Cleanup
          2. Data Generation
        2. Customer data import into Hive using Sqoop
        3. Data acquisition via Flume into Kafka channel
        4. Data ingestion via Flink to HDFS and Elasticsearch
          1. Packaging via POM file
          2. Avro schema definitions
          3. Schema deserialization class
          4. Writing to HDFS as parquet files
        5. Writing into Elasticsearch
          1. Command line arguments
        6. Flink deployment
        7. Parquet data visualization as Hive tables
        8. Data indexing from Hive
        9. Query data from ES (customer, address, and contacts)
    18. When to use Elasticsearch
    19. When not to use Elasticsearch
    20. Other options
      1. Apache Solr
    21. Summary
  12. Data Lake Components Working Together
    1. Where we stand with Data Lake
    2. Core architecture principles of Data Lake
    3. Challenges faced by enterprise Data Lake
    4. Expectations from Data Lake
    5. Data Lake for other activities
    6. Knowing more about data storage
      1. Zones in Data Storage
      2. Data Schema and Model
      3. Storage options
        1. Apache HCatalog (Hive Metastore)
      4. Compression methodologies
      5. Data partitioning
    7. Knowing more about Data processing
      1. Data validation and cleansing
      2. Machine learning
      3. Scheduler/Workflow
      4. Apache Oozie
        1. Database setup and configuration
        2. Build from Source
        3. Oozie Workflows
        4. Oozie coordinator
      5. Complex event processing
    8. Thoughts on data security
      1. Apache Knox
      2. Apache Ranger
      3. Apache Sentry
    9. Thoughts on data encryption
      1. Hadoop key management server
    10. Metadata management and governance
      1. Metadata
      2. Data governance
      3. Data lineage
      4. How can we achieve?
        1. Apache Atlas
        2. WhereHows
    11. Thoughts on Data Auditing
    12. Thoughts on data traceability
    13. Knowing more about Serving Layer
      1. Principles of Serving Layer
      2. Service Types
        1. GraphQL
        2. Data Lake with REST API
        3. Business services
      3. Serving Layer components
        1. Data Services
        2. Elasticsearch & HBase
        3. Apache Hive & Impala
        4. RDBMS
      4. Data exports
      5. Polyglot data access
      6. Example: serving layer
    14. Summary
  13. Data Lake Use Case Suggestions
    1. Establishing cybersecurity practices in an enterprise
    2. Know the customers dealing with your enterprise
    3. Bring efficiency in warehouse management
    4. Developing a brand and marketing of the enterprise
    5. Achieve a higher degree of personalization with customers
    6. Bringing IoT data analysis at your fingertips
    7. More practical and useful data archival
    8. Compliment the existing data warehouse infrastructure
    9. Achieving telecom security and regulatory compliance
    10. Summary