book

Big Data Simplified

Name: Big Data Simplified
ISBN: 9789353941505

by Sayan Goswami, Amit Kumar Das, Sourabh Mukherjee

June 2019

Beginner to intermediate

360 pages

10h 55m

English

Pearson Education India

Read now

Unlock full access

About the Authors
Model Syllabus for Big Data
Lesson Plan
Chapter 1 A Closer Look at Data
1.1 Introduction
1.2 Types of Data1.2.1 Structured Data
1.2.2 Unstructured Data
1.2.3 Semi-Structured Data
1.3 The Emergence of ‘New Data’
1.4 ‘New’ Data and ‘Traditional’ Data Compared
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 2 Introducing Big Data
2.1 Introduction
2.2 The Transition to Big Data
2.3 The Definition of Big Data
2.4 The V’s
2.5 Sources of Big Data
2.6 Common Applications of Big Data
2.7 An Introduction to Big Data Technologies2.7.1 Hadoop
2.7.2 MapReduce
2.7.3 Hadoop Affiliate Technologies
2.7.4 Massively Parallel Processing
2.7.5 NoSQL
2.7.6 Hadoop Hybrids
2.8 An Overview of Popular Vendors
2.8.1 Hadoop Distributions2.8.2 Hadoop in the Cloud
2.8.3 HDFS-Alternative Products
2.8.4 NoSQL2.8.5 MPP Products2.8.6 Hybrids2.8.7 Data Integration, Visualization, Analytics2.8.8 Business Intelligence (BI)
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 3 Introducing Hadoop
3.1 Introduction
3.2 An Overview of Hadoop
3.3 Configuring a Hadoop Cluster (1/2)
3.3 Configuring a Hadoop Cluster (2/2)
3.4 Storing Data with HDFS
3.4.1 The NameNode and DataNodes
3.4.2 Storing and Reading Files from HDFS
3.4.3 Fault Tolerance with Replication
3.4.4 NameNode Failure Management
3.5 HDFS Technical Commands
3.6 Hadoop Distributions
3.7 Hadoop in the Cloud
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 4 Introducing MapReduce
4.1 Introduction
4.2 Processing Data with MapReduce
4.2.1 A MapReduce Example
4.2.2 Technical Flow of a MapReduce Job
4.2.3 End-to-End Technical Anatomy of a MapReduce Job
4.3 Parallelism in Map and Reduce Phases
4.3.1 Using a Single Reducer
4.3.2 Using Multiple Reducers
4.4 Optimize the Map Phase Using a Combiner
4.4.1 Reducers as Combiners
4.5 What is YARN?
4.5.1 Scheduling and Managing Tasks4.5.2 Job Execution in the Hadoop Cluster
4.5.3 Troubleshoot a MapReduce Job in Hadoop Cluster
4.6 Example Use Case on MapReduce: Development and Execution Step-by-step (1/2)
4.6 Example Use Case on MapReduce: Development and Execution Step-by-step (2/2)
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 5 Introducing NoSQL
5.1 Introduction
5.2 NoSQL Databases in the Light of CAP Theorem
5.3 NoSQL Product Categories
5.3.1 Key-value Stores
5.3.2 Wide Column Stores or Columnar Stores
5.3.3 Document Stores
5.3.4 Graph Databases
5.4 NoSQL Database: Cassandra
5.4.1 Characteristics of Cassandra5.4.2 Cassandra Architecture
5.4.3 Components of Cassandra
5.4.4 Cassandra Write Operations at a Node Level
5.4.5 Cassandra Node Level Read Operation
5.4.6 KEYSPACE in Cassandra
5.4.7 Starting Cassandra Server and Cqlsh Query Editor
5.4.8 DataStax Distribution Package
5.5 NoSQL Databases in the Cloud
5.6 NoSQL – Do’s and Don’ts
5.7 Business Intelligence and NoSQL
5.8 Big Data and NoSQL
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 6 Introducing Spark and Kafka
6.1 Introducing Spark
6.1.1 Hadoop and Spark
6.1.2 Spark Programming Languages
6.1.3 Understanding Spark Architecture
6.1.4 Spark Libraries: Spark SQL
6.1.5 Spark Libraries: Streaming
6.1.6 Spark Libraries: Machine Learning
6.1.7 Spark Libraries: GraphX
6.1.8 PySpark: Spark with Python
6.2 Working with Kafka
6.2.1 What is Apache Kafka
6.2.2 Kafka Architecture
6.2.3 Need of Apache Kafka in Big Data
6.2.4 Kafka Use Cases
6.2.5 Why is Kafka so Fast?
6.2.6 Kafka Needs ZooKeeper6.2.7 Different Components in Kafka
6.2.8 Difference between Apache Kafka and Apache Flume
6.2.9 Kafka Demonstration—How Messages are Passing from Publisher to Consumer through a Topic
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 7 Other BigData Tools and Technologies
7.1 Introduction
7.2 Hive7.2.1 Hive Architecture
7.2.2 Data Flow in Hive
7.2.3 Data Types in Hive
7.2.4 Different Types of Tables in Hive (1/2)
7.2.4 Different Types of Tables in Hive (2/2)
7.2.5 Partitioning and Bucketing in Hive
7.3 Pig
7.3.1 Why Apache Pig
7.3.2 Features of Apache Pig
7.3.3 Apache Pig vs. MapReduce7.3.4 Pig Architecture
7.4 Sqoop and Flume
7.4.1 SqoopEXPORT (Data Transfer from HDFS to MySQL)
7.4.2 Sqoop IMPORT (Importing Fresh Table from MySQL to HIVE)
7.4.3 Flume
7.4.4 Components of Flume
7.4.5 Configure Flume to Ingest Web Log Data from a Local Directory to HDFS
7.5 Oozie
7.5.1 Oozie Workflow
7.6 Lucene and Solr
7.6.1 Lucene in Search Applications
7.6.2 Features of Apache Solr
7.6.3 Apache Solr—Basic Commands
7.7 Zookeeper7.8 Apache NiFi7.8.1 What Apache NiFi Does
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 8 Working with Big Data in R
8.1 Prerequisites
8.1.1 Install R in Your System8.1.2 Know How to Manage R Scripts
8.1.3 Introduction to Basic R Commands (1/3)
8.1.3 Introduction to Basic R Commands (2/3)
8.1.3 Introduction to Basic R Commands (3/3)
8.2 Exploratory Data Analysis
8.2.1 Basic Statistical Techniques for Data Exploration
8.2.2 Basic Plots for Data Exploration
8.3 R Libraries for Dealing with Large Data Sets
8.3.1 ff and ffbase Packages
8.3.2 Parallel Package
8.3.3 data.table Package
8.4 Integrating Hadoop with R
8.5 Simple R Program with Hadoop
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 9 Working with Big Data in Python
9.1 Prerequisites
9.1.1 Install Python in Your System9.1.2 Know How to Manage Python Scripts
9.1.3 Introduction to Basic Python Commands
9.2 Basic Libraries in Python
9.2.1 NumPy Library (1/2)
9.2.1 NumPy Library (2/2)
9.2.2 Pandas Library
9.2.3 Matplotlib Library
9.3 Python Libraries for Dealing with Large Data Sets
9.3.1 numpy.memmap Object
9.3.2 Parallel Computing Using mp4pi Library
9.4 Python-MapReduce Using Hadoop Streaming
9.4.1 What is Hadoop Streaming?
9.4.2 Python MapReduce Code
9.4.3 Step by Step Execution
9.4.4 Running the MapReduce Python Code on Hadoop
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 10 Big Data Applied
10.1 Introduction
10.2 Big Data and Data Science10.2.1 What is Data Science?10.2.2 Who is a Data Scientist?
10.2.3 How Do We Do Define ‘Data Science’?
10.2.4 Common Pitfalls of Data Science
10.3 Big Data and IoT
10.3.1 What is IoT?10.3.2 Overview of IoT Architecture
10.3.3 IoT in Action
10.3.4 Impacts of IoT
10.3.5 Applications of Big Data and IoT
10.4 Big Data and Recommendation Engines
10.4.1 What is a Recommendation?10.4.2 What are Recommendation Engines?
10.4.3 What are the Types of Recommendation Engines?
10.4.4 How is Big Data Used in a Recommendation Engine?
Summary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 11 Big Data Strategy
11.1 Introduction
11.2 Two Typical Big Data Use Cases11.2.1 Big Data Primarily for Cost Reduction
11.2.2 Big Data Primarily for Enhanced Value
11.3 Data Warehouses vs. Data Lakes—What is Your Strategy?
11.3.1 Differences between Data Warehouse and Data Lake
11.4 Key Questions to Ask
11.5 Getting Ready for a Big Data Program
11.6 Making Technology Choices
11.7 Making Tooling Choices
Summary
Short-answer Type Questions (5 Marks Questions)
Long-answer Type Questions (10 Marks Questions)
Chapter 12 Case Study: Retail Near Real-time Analytics
12.1 Introduction to Retail Domain
12.1.1 What is Retail in the First Place?
12.1.2 So, Why is Retailing So Important?
12.2 Near Real-time Analytics: Problem Statement
12.3 NRT Analytics: Solution Approach
12.4 NRT Analytics: Details of Solution Implemented (1/3)
12.4 NRT Analytics: Details of Solution Implemented (2/3)
12.4 NRT Analytics: Details of Solution Implemented (3/3)
12.4.1 Data from Producer
12.4.2 Output After Running Analysis Using Spark12.4.3 Data Saved in Cassandra12.4.4 Kafka Producer Streamed in Batch Mode After Every 2 Minutes
12.4.5 Data Streamed After 2 Minutes Containing the New Data
12.4.6 New Data Got Entered in CassandraSummary
Multiple-choice Questions (1 Mark Questions)
Short-answer Type Questions (5 Marks Questions)
Appendix (1/2)
Appendix (2/2)
Index

Content preview from Big Data Simplified

Introducing Hadoop | 59

3.4.4 NameNode Failure Management

Now that you have an understanding of how replication handles failures in DataNodes, let us

take a look at failure management for the NameNode. As you now realize, the NameNode is the

most important node in the cluster. Without the NameNode, we will have no idea which le is

stored on which DataNode. In the mappings that the NameNode holds within itself, the block

locations are not persistent. These block locations are stored in memory for quick lookup, where

it is called block caching.

If the NameNode fails, then the file block locations in memory will be completely lost.

Therefore, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9789353941505

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Big Data Simplified

by Sayan Goswami, Amit Kumar Das, Sourabh Mukherjee

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Big Data

Big Data for Architects

Modern Big Data Processing with Hadoop

Practical Big Data Analytics

Publisher Resources