O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Modern Big Data Processing with Hadoop

Book Description

A comprehensive guide to design, build and execute effective Big Data strategies using Hadoop

About This Book

  • Get an in-depth view of the Apache Hadoop ecosystem and an overview of the architectural patterns pertaining to the popular Big Data platform
  • Conquer different data processing and analytics challenges using a multitude of tools such as Apache Spark, Elasticsearch, Tableau and more
  • A comprehensive, step-by-step guide that will teach you everything you need to know, to be an expert Hadoop Architect

Who This Book Is For

This book is for Big Data professionals who want to fast-track their career in the Hadoop industry and become an expert Big Data architect. Project managers and mainframe professionals looking forward to build a career in Big Data Hadoop will also find this book to be useful. Some understanding of Hadoop is required to get the best out of this book.

What You Will Learn

  • Build an efficient enterprise Big Data strategy centered around Apache Hadoop
  • Gain a thorough understanding of using Hadoop with various Big Data frameworks such as Apache Spark, Elasticsearch and more
  • Set up and deploy your Big Data environment on premises or on the cloud with Apache Ambari
  • Design effective streaming data pipelines and build your own enterprise search solutions
  • Utilize the historical data to build your analytics solutions and visualize them using popular tools such as Apache Superset
  • Plan, set up and administer your Hadoop cluster efficiently

In Detail

The complex structure of data these days requires sophisticated solutions for data transformation, to make the information more accessible to the users.This book empowers you to build such solutions with relative ease with the help of Apache Hadoop, along with a host of other Big Data tools.

This book will give you a complete understanding of the data lifecycle management with Hadoop, followed by modeling of structured and unstructured data in Hadoop. It will also show you how to design real-time streaming pipelines by leveraging tools such as Apache Spark, and build efficient enterprise search solutions using Elasticsearch. You will learn to build enterprise-grade analytics solutions on Hadoop, and how to visualize your data using tools such as Apache Superset. This book also covers techniques for deploying your Big Data solutions on the cloud Apache Ambari, as well as expert techniques for managing and administering your Hadoop cluster.

By the end of this book, you will have all the knowledge you need to build expert Big Data systems.

Style and approach

Comprehensive guide with a perfect blend of theory, examples and implementation of real-world use-cases

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Modern Big Data Processing with Hadoop
  3. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  4. Contributors
    1. About the authors
    2. About the reviewers
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Enterprise Data Architecture Principles
    1. Data architecture principles
      1. Volume
      2. Velocity
      3. Variety
      4. Veracity
    2. The importance of metadata
    3. Data governance
      1. Fundamentals of data governance
    4. Data security
      1. Application security
      2. Input data
      3. Big data security
      4. RDBMS security
      5. BI security
      6. Physical security
      7. Data encryption
      8. Secure key management
    5. Data as a Service
    6. Evolution data architecture with Hadoop
      1. Hierarchical database architecture
      2. Network database architecture
      3. Relational database architecture
        1. Employees
        2. Devices
        3. Department
        4. Department and employee mapping table
      4. Hadoop data architecture
        1. Data layer
        2. Data management layer
        3. Job execution layer
    7. Summary
  7. Hadoop Life Cycle Management
    1. Data wrangling
      1. Data acquisition
      2. Data structure analysis
      3. Information extraction
      4. Unwanted data removal
      5. Data transformation
      6. Data standardization
    2. Data masking
      1. Substitution
        1. Static 
        2. Dynamic
          1. Encryption
          2. Hashing
        3. Hiding
        4. Erasing
        5. Truncation
        6. Variance
        7. Shuffling
    3. Data security
      1. What is Apache Ranger?
      2. Apache Ranger installation using Ambari
        1. Ambari admin UI
        2. Add service
        3. Service placement
        4. Service client placement
        5. Database creation on master
        6. Ranger database configuration
        7. Configuration changes
        8. Configuration review
        9. Deployment progress
        10. Application restart
      3. Apache Ranger user guide
        1. Login to UI
        2. Access manager
        3. Service details
        4. Policy definition and auditing for HDFS
    4. Summary
  8. Hadoop Design Consideration
    1. Understanding data structure principles
    2. Installing Hadoop cluster
      1. Configuring Hadoop on NameNode
      2. Format NameNode
      3. Start all services
    3. Exploring HDFS architecture
      1. Defining NameNode
        1. Secondary NameNode
        2. NameNode safe mode
      2. DataNode
        1. Data replication
      3. Rack awareness
      4. HDFS WebUI
    4. Introducing YARN
      1. YARN architecture
        1. Resource manager
        2. Node manager
      2. Configuration of YARN
    5. Configuring HDFS high availability
      1. During Hadoop 1.x
      2. During Hadoop 2.x and onwards
      3. HDFS HA cluster using NFS
        1. Important architecture points
      4. Configuration of HA NameNodes with shared storage
      5. HDFS HA cluster using the quorum journal manager
        1. Important architecture points
    6. Configuration of HA NameNodes with QJM
      1. Automatic failover
        1. Important architecture points
      2. Configuring automatic failover
    7. Hadoop cluster composition
      1. Typical Hadoop cluster
    8. Best practices Hadoop deployment
    9. Hadoop file formats
      1. Text/CSV file
      2. JSON
      3. Sequence file
      4. Avro
      5. Parquet
      6. ORC
      7. Which file format is better?
    10. Summary
  9. Data Movement Techniques
    1. Batch processing versus real-time processing
      1. Batch processing
      2. Real-time processing
    2. Apache Sqoop
      1. Sqoop Import
        1. Import into HDFS
        2. Import a MySQL table into an HBase table
      2. Sqoop export
    3. Flume
      1. Apache Flume architecture
      2. Data flow using Flume
      3. Flume complex data flow architecture
        1. Flume setup
      4. Log aggregation use case
    4. Apache NiFi
      1. Main concepts of Apache NiFi
      2. Apache NiFi architecture
      3. Key features
      4. Real-time log capture dataflow
    5. Kafka Connect
      1. Kafka Connect – a brief history
      2. Why Kafka Connect?
      3. Kafka Connect features
      4. Kafka Connect architecture
      5. Kafka Connect workers modes
        1. Standalone mode
        2. Distributed mode
      6. Kafka Connect cluster distributed architecture
        1. Example 1
        2. Example 2
    6. Summary
  10. Data Modeling in Hadoop
    1. Apache Hive
      1. Apache Hive and RDBMS
    2. Supported datatypes
    3. How Hive works
    4. Hive architecture
    5. Hive data model management
      1. Hive tables
        1. Managed tables
        2. External tables
      2. Hive table partition
        1. Hive static partitions and dynamic partitions
      3. Hive partition bucketing
        1. How Hive bucketing works
        2. Creating buckets in a non-partitioned table
        3. Creating buckets in a partitioned table
      4. Hive views
        1. Syntax of a view
        2. Hive indexes
          1. Compact index
          2. Bitmap index
    6. JSON documents using Hive
      1. Example 1 – Accessing simple JSON documents with Hive (Hive 0.14 and later versions)
      2. Example 2 – Accessing nested JSON documents with Hive (Hive 0.14 and later versions)
      3. Example 3 – Schema evolution with Hive and Avro (Hive 0.14 and later versions)
    7. Apache HBase
      1. Differences between HDFS and HBase
      2. Differences between Hive and HBase
      3. Key features of HBase
      4. HBase data model
      5. Difference between RDBMS table and column - oriented data store
      6. HBase architecture
        1. HBase architecture in a nutshell
        2. HBase rowkey design
      7. Example 4 – loading data from MySQL table to HBase table
      8. Example 5 – incrementally loading data from MySQL table to HBase table
      9. Example 6 – Load the MySQL customer changed data into the HBase table
      10. Example 7 – Hive HBase integration
    8. Summary
  11. Designing Real-Time Streaming Data Pipelines
    1. Real-time streaming concepts
      1. Data stream
      2. Batch processing versus real-time data processing
      3. Complex event processing 
      4. Continuous availability
      5. Low latency
      6. Scalable processing frameworks
      7. Horizontal scalability
      8. Storage
    2. Real-time streaming components
      1. Message queue
        1. So what is Kafka?
      2. Kafka features
      3. Kafka architecture
        1. Kafka architecture components
      4. Kafka Connect deep dive
      5. Kafka Connect architecture
        1. Kafka Connect workers standalone versus distributed mode
          1. Install Kafka
          2. Create topics
          3. Generate messages to verify the producer and consumer
          4. Kafka Connect using file Source and Sink
          5. Kafka Connect using JDBC and file Sink Connectors
    3. Apache Storm
      1. Features of Apache Storm
      2. Storm topology
        1. Storm topology components
      3. Installing Storm on a single node cluster
      4. Developing a real-time streaming pipeline with Storm
        1. Streaming a pipeline from Kafka to Storm to MySQL
        2. Streaming a pipeline with Kafka to Storm to HDFS
    4. Other popular real-time data streaming frameworks
      1. Kafka Streams API
      2. Spark Streaming
      3. Apache Flink
    5. Apache Flink versus Spark
    6. Apache Spark versus Storm
    7. Summary
  12. Large-Scale Data Processing Frameworks
    1. MapReduce
    2. Hadoop MapReduce
      1. Streaming MapReduce
      2. Java MapReduce
      3. Summary
    3. Apache Spark 2
      1. Installing Spark using Ambari
        1. Service selection in Ambari Admin
        2. Add Service Wizard
        3. Server placement
        4. Clients and Slaves selection
        5. Service customization
        6. Software deployment
        7. Spark installation progress
        8. Service restarts and cleanup
      2. Apache Spark data structures
        1. RDDs, DataFrames and datasets
      3. Apache Spark programming
        1. Sample data for analysis
        2. Interactive data analysis with pyspark
        3. Standalone application with Spark
        4. Spark streaming application
        5. Spark SQL application
    4. Summary
  13. Building Enterprise Search Platform
    1. The data search concept
    2. The need for an enterprise search engine
      1. Tools for building an enterprise search engine
    3. Elasticsearch
      1. Why Elasticsearch?
      2.  Elasticsearch components
        1. Index
        2. Document
        3. Mapping
        4. Cluster
        5. Type
    4. How to index documents in Elasticsearch?
      1. Elasticsearch installation
        1. Installation of Elasticsearch
        2. Create index
        3. Primary shard
        4. Replica shard
          1. Ingest documents into index
        5. Bulk Insert
        6. Document search
        7. Meta fields
    5. Mapping
      1. Static mapping
      2. Dynamic mapping
    6. Elasticsearch-supported data types
      1. Mapping example
    7. Analyzer
      1. Elasticsearch stack components
        1. Beats
    8. Logstash
    9. Kibana
    10. Use case
    11. Summary
  14. Designing Data Visualization Solutions
    1. Data visualization
      1. Bar/column chart
      2. Line/area chart
      3. Pie chart
      4. Radar chart
      5. Scatter/bubble chart
      6. Other charts
    2. Practical data visualization in Hadoop
      1. Apache Druid
        1. Druid components
        2. Other required components
        3. Apache Druid installation
          1. Add service
          2. Select Druid and Superset
          3. Service placement on servers
          4. Choose Slaves and Clients
          5. Service configurations
          6. Service installation
          7. Installation summary
          8. Sample data ingestion into Druid
      2. MySQL database
        1. Sample database
          1. Download the sample dataset
          2. Copy the data to MySQL
          3. Verify integrity of the tables
          4. Single Normalized Table
      3. Apache Superset
        1. Accessing the Superset application
        2. Superset dashboards
        3. Understanding Wikipedia edits data
        4. Create Superset Slices using Wikipedia data
          1. Unique users count
          2. Word Cloud for top US regions
          3. Sunburst chart – top 10 cities
          4. Top 50 channels and namespaces via directed force layout
          5. Top 25 countries/channels distribution
        5. Creating wikipedia edits dashboard from Slices
      4. Apache Superset with RDBMS
        1. Supported databases
        2. Understanding employee database
          1. Employees table
          2. Departments table
          3. Department manager table
          4. Department Employees Table
          5. Titles table
          6. Salaries table
          7. Normalized employees table
        3. Superset Slices for employees database
          1. Register MySQL database/table
        4. Slices and Dashboard creation
          1. Department salary breakup
          2. Salary Diversity
          3. Salary Change Per Role Per Year
          4. Dashboard creation
    3. Summary
  15. Developing Applications Using the Cloud
    1. What is the Cloud?
    2. Available technologies in the Cloud
    3. Planning the Cloud infrastructure
      1. Dedicated servers versus shared servers
        1. Dedicated servers
        2. Shared servers
      2. High availability
      3. Business continuity planning
        1. Infrastructure unavailability
        2. Natural disasters
        3. Business data
        4. BCP design example
          1. The Hot–Hot system
          2. The Hot–Cold system
      4. Security
        1. Server security
        2. Application security
        3. Network security
        4. Single Sign On
        5. The AAA requirement
    4. Building a Hadoop cluster in the Cloud
      1. Google Cloud Dataproc
        1. Getting a Google Cloud account
        2. Activating the Google Cloud Dataproc service
        3. Creating a new Hadoop cluster
        4. Logging in to the cluster
        5. Deleting the cluster 
    5. Data access in the Cloud
      1. Block storage
      2. File storage
      3. Encrypted storage
      4. Cold storage
    6. Summary
  16. Production Hadoop Cluster Deployment
    1. Apache Ambari architecture
      1. The Ambari server
        1. Daemon management
        2. Software upgrade
        3. Software setup
        4. LDAP/PAM/Kerberos management
        5. Ambari backup and restore
        6. Miscellaneous options
      2. Ambari Agent
      3. Ambari web interface
      4. Database
    2. Setting up a Hadoop cluster with Ambari
      1. Server configurations
      2. Preparing the server 
      3. Installing the Ambari server 
      4. Preparing the Hadoop cluster
      5. Creating the Hadoop cluster 
      6. Ambari web interface
      7. The Ambari home page
        1. Creating a cluster
        2. Managing users and groups
        3. Deploying views
      8. The cluster install wizard
        1. Naming your cluster
        2. Selecting the Hadoop version 
        3. Selecting a server 
        4. Setting up the node
        5. Selecting services
        6. Service placement on nodes
        7. Selecting slave and client nodes 
        8. Customizing services
        9. Reviewing the services
        10. Installing the services on the nodes
        11. Installation summary
        12. The cluster dashboard
    3. Hadoop clusters
      1. A single cluster for the entire business
      2. Multiple Hadoop clusters
        1. Redundancy
          1. A fully redundant Hadoop cluster
          2. A data redundant Hadoop cluster
        2. Cold backup
        3. High availability
        4. Business continuity
        5. Application environments
      3. Hadoop data copy
        1. HDFS data copy
    4. Summary