O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Big Data Analytics with Hadoop 3

Book Description

Explore big data concepts, platforms, analytics, and their applications using the power of Hadoop 3

About This Book
  • Learn Hadoop 3 to build effective big data analytics solutions on-premise and on cloud
  • Integrate Hadoop with other big data tools such as R, Python, Apache Spark, and Apache Flink
  • Exploit big data using Hadoop 3 with real-world examples
Who This Book Is For

Big Data Analytics with Hadoop 3 is for you if you are looking to build high-performance analytics solutions for your enterprise or business using Hadoop 3's powerful features, or you're new to big data analytics. A basic understanding of the Java programming language is required.

What You Will Learn
  • Explore the new features of Hadoop 3 along with HDFS, YARN, and MapReduce
  • Get well-versed with the analytical capabilities of Hadoop ecosystem using practical examples
  • Integrate Hadoop with R and Python for more efficient big data processing
  • Learn to use Hadoop with Apache Spark and Apache Flink for real-time data analytics
  • Set up a Hadoop cluster on AWS cloud
  • Perform big data analytics on AWS using Elastic Map Reduce
In Detail

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples.

Once you have taken a tour of Hadoop 3's latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases.

By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly.

Style and approach

Filled with practical examples and use cases, this book will not only help you get up and running with Hadoop, but will also take you farther down the road to deal with Big Data Analytics

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Big Data Analytics with Hadoop 3
  3. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  4. Contributors
    1. About the author
    2. About the reviewers
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Introduction to Hadoop
    1. Hadoop Distributed File System
      1. High availability
      2. Intra-DataNode balancer
      3. Erasure coding
      4. Port numbers
    2. MapReduce framework
      1. Task-level native optimization
    3. YARN
      1. Opportunistic containers
        1. Types of container execution 
      2. YARN timeline service v.2
        1. Enhancing scalability and reliability
        2. Usability improvements
        3. Architecture
    4. Other changes
      1. Minimum required Java version 
      2. Shell script rewrite
      3. Shaded-client JARs
    5. Installing Hadoop 3 
      1. Prerequisites
      2. Downloading
      3. Installation
      4. Setup password-less ssh
      5. Setting up the NameNode
      6. Starting HDFS
      7. Setting up the YARN service
      8. Erasure Coding
      9. Intra-DataNode balancer
      10. Installing YARN timeline service v.2
        1. Setting up the HBase cluster
          1. Simple deployment for HBase
        2. Enabling the co-processor
        3. Enabling timeline service v.2
          1. Running timeline service v.2
          2. Enabling MapReduce to write to timeline service v.2
    6. Summary
  7. Overview of Big Data Analytics
    1. Introduction to data analytics
      1. Inside the data analytics process
    2. Introduction to big data
      1. Variety of data
      2. Velocity of data
      3. Volume of data
      4. Veracity of data
      5. Variability of data
      6. Visualization
      7. Value
    3. Distributed computing using Apache Hadoop
    4. The MapReduce framework
    5. Hive
      1. Downloading and extracting the Hive binaries
      2. Installing Derby
      3. Using Hive
        1. Creating a database
        2. Creating a table
      4. SELECT statement syntax
        1. WHERE clauses
      5. INSERT statement syntax
      6. Primitive types
      7. Complex types
      8. Built-in operators and functions
        1. Built-in operators
        2. Built-in functions
      9. Language capabilities
        1. A cheat sheet on retrieving information 
    6. Apache Spark
    7. Visualization using Tableau
    8. Summary
  8. Big Data Processing with MapReduce
    1. The MapReduce framework
      1. Dataset
      2. Record reader
      3. Map
      4. Combiner
      5. Partitioner
      6. Shuffle and sort
      7. Reduce
      8. Output format
    2. MapReduce job types
      1. Single mapper job
      2. Single mapper reducer job
      3. Multiple mappers reducer job
      4. SingleMapperCombinerReducer job
      5. Scenario
    3. MapReduce patterns
      1. Aggregation patterns
        1. Average temperature by city
          1. Record count
          2. Min/max/count
          3. Average/median/standard deviation
      2. Filtering patterns
      3. Join patterns
        1. Inner join
        2. Left anti join
        3. Left outer join
        4. Right outer join
        5. Full outer join
        6. Left semi join
        7. Cross join
    4. Summary
  9. Scientific Computing and Big Data Analysis with Python and Hadoop
    1. Installation
      1. Installing standard Python
      2. Installing Anaconda
        1. Using Conda
    2. Data analysis
    3. Summary
  10. Statistical Big Data Computing with R and Hadoop
    1. Introduction
      1. Install R on workstations and connect to the data in Hadoop
      2. Install R on a shared server and connect to Hadoop
      3. Utilize Revolution R Open
      4. Execute R inside of MapReduce using RMR2
        1. Summary and outlook for pure open source options
    2. Methods of integrating R and Hadoop
      1. RHADOOP – install R on workstations and connect to data in Hadoop
      2. RHIPE – execute R inside Hadoop MapReduce
      3. R and Hadoop Streaming
      4. RHIVE – install R on workstations and connect to data in Hadoop
      5. ORCH – Oracle connector for Hadoop
    3. Data analytics
    4. Summary
  11. Batch Analytics with Apache Spark
    1. SparkSQL and DataFrames
    2. DataFrame APIs and the SQL API
      1. Pivots
      2. Filters
      3. User-defined functions
    3. Schema – structure of data
      1. Implicit schema
      2. Explicit schema
      3. Encoders
    4. Loading datasets
    5. Saving datasets
    6. Aggregations
      1. Aggregate functions
        1. count
        2. first
        3. last
        4. approx_count_distinct
        5. min
        6. max
        7. avg
        8. sum
        9. kurtosis
        10. skewness
        11. Variance
        12. Standard deviation
        13. Covariance
        14. groupBy
        15. Rollup
        16. Cube
      2. Window functions
      3. ntiles
    7. Joins
      1. Inner workings of join
      2. Shuffle join
      3. Broadcast join
      4. Join types
      5. Inner join
      6. Left outer join
      7. Right outer join
      8. Outer join
      9. Left anti join
      10. Left semi join
      11. Cross join
      12. Performance implications of join
    8. Summary
  12. Real-Time Analytics with Apache Spark
    1. Streaming
      1. At-least-once processing
      2. At-most-once processing
      3. Exactly-once processing
    2. Spark Streaming
      1. StreamingContext
      2. Creating StreamingContext
      3. Starting StreamingContext
      4. Stopping StreamingContext
        1. Input streams
          1. receiverStream
          2. socketTextStream
          3. rawSocketStream
    3. fileStream
      1. textFileStream
      2. binaryRecordsStream
      3. queueStream
        1. textFileStream example
        2. twitterStream example
      4. Discretized Streams
    4. Transformations
      1. Windows operations
      2. Stateful/stateless transformations
        1. Stateless transformations
        2. Stateful transformations
    5. Checkpointing
      1. Metadata checkpointing
      2. Data checkpointing
    6. Driver failure recovery
    7. Interoperability with streaming platforms (Apache Kafka)
      1. Receiver-based
      2. Direct Stream
      3. Structured Streaming
        1. Getting deeper into Structured Streaming
    8. Handling event time and late date
    9. Fault-tolerance semantics
    10. Summary
  13. Batch Analytics with Apache Flink
    1. Introduction to Apache Flink
      1. Continuous processing for unbounded datasets
      2. Flink, the streaming model, and bounded datasets
    2. Installing Flink
      1. Downloading Flink
      2. Installing Flink
        1. Starting a local Flink cluster
    3. Using the Flink cluster UI
    4. Batch analytics
      1. Reading file
        1. File-based
        2. Collection-based
        3. Generic
      2. Transformations
      3. GroupBy
      4. Aggregation
      5. Joins
        1. Inner join
        2. Left outer join
        3. Right outer join
        4. Full outer join
      6. Writing to a file
    5. Summary
  14. Stream Processing with Apache Flink
    1. Introduction to streaming execution model
    2. Data processing using the DataStream API
      1. Execution environment
      2. Data sources
        1. Socket-based
        2. File-based
      3. Transformations
        1. map
        2. flatMap
        3. filter
        4. keyBy
        5. reduce
        6. fold
        7. Aggregations
        8. window
          1. Global windows
          2. Tumbling windows
          3. Sliding windows
          4. Session windows
        9. windowAll
        10. union
        11. Window join
        12. split
        13. Select
        14. Project
        15. Physical partitioning
          1. Custom partitioning
          2. Random partitioning
          3. Rebalancing partitioning
        16. Rescaling
        17. Broadcasting
        18. Event time and watermarks
        19. Connectors
          1. Kafka connector
          2. Twitter connector
          3. RabbitMQ connector
          4. Elasticsearch connector
          5. Cassandra connector
    3. Summary
  15. Visualizing Big Data
    1. Introduction
    2. Tableau
    3. Chart types
      1. Line charts
      2. Pie chart
      3. Bar chart
      4. Heat map
    4. Using Python to visualize data
    5. Using R to visualize data
    6. Big data visualization tools
    7. Summary
  16. Introduction to Cloud Computing
    1. Concepts and terminology
      1. Cloud
      2. IT resource
      3. On-premise
      4. Cloud consumers and Cloud providers
      5. Scaling
        1.  Types of scaling
          1. Horizontal scaling
          2. Vertical scaling
        2. Cloud service
        3. Cloud service consumer
    2. Goals and benefits
      1. Increased scalability
      2. Increased availability and reliability
    3. Risks and challenges
      1. Increased security vulnerabilities
      2. Reduced operational governance control
      3. Limited portability between Cloud providers
    4. Roles and boundaries
      1. Cloud provider
      2. Cloud consumer
      3. Cloud service owner
      4. Cloud resource administrator
        1. Additional roles
        2. Organizational boundary
        3. Trust boundary
    5. Cloud characteristics
      1. On-demand usage
      2. Ubiquitous access
      3. Multi-tenancy (and resource pooling)
      4. Elasticity
      5. Measured usage
      6. Resiliency
    6. Cloud delivery models
      1. Infrastructure as a Service
      2. Platform as a Service
      3. Software as a Service
      4. Combining Cloud delivery models
        1. IaaS + PaaS
        2. IaaS + PaaS + SaaS
    7. Cloud deployment models
      1. Public Clouds
      2. Community Clouds
      3. Private Clouds
      4. Hybrid Clouds
    8. Summary
  17. Using Amazon Web Services
    1. Amazon Elastic Compute Cloud
      1. Elastic web-scale computing
      2. Complete control of operations
      3. Flexible Cloud hosting services
      4. Integration
      5. High reliability
      6. Security
      7. Inexpensive
      8. Easy to start
      9. Instances and Amazon Machine Images
    2. Launching multiple instances of an AMI
      1. Instances
      2. AMIs
      3. Regions and availability zones
      4. Region and availability zone concepts
      5. Regions
      6. Availability zones
      7. Available regions
      8. Regions and endpoints
      9. Instance types
        1. Tag basics
        2. Amazon EC2 key pairs
        3. Amazon EC2 security groups for Linux instances
        4. Elastic IP addresses
      10. Amazon EC2 and Amazon Virtual Private Cloud
        1. Amazon Elastic Block Store
        2. Amazon EC2 instance store
    3. What is AWS Lambda?
      1. When should I use AWS Lambda?
    4. Introduction to Amazon S3
      1. Getting started with Amazon S3
      2. Comprehensive security and compliance capabilities
      3. Query in place
      4. Flexible management
      5. Most supported platform with the largest ecosystem
      6. Easy and flexible data transfer
      7. Backup and recovery
      8. Data archiving
      9. Data lakes and big data analytics
      10. Hybrid Cloud storage
      11. Cloud-native application data
      12. Disaster recovery
    5. Amazon DynamoDB
    6. Amazon Kinesis Data Streams
      1. What can I do with Kinesis Data Streams?
        1. Accelerated log and data feed intake and processing
        2. Real-time metrics and reporting
        3. Real-time data analytics
        4. Complex stream processing
        5. Benefits of using Kinesis Data Streams
    7. AWS Glue
      1. When should I use AWS Glue?
    8. Amazon EMR
      1. Practical AWS EMR cluster
    9. Summary