Big Data Simplified

Book description

"Big Data Simplified blends technology with strategy and delves into applications of big data in specialized areas, such as recommendation engines, data science and Internet of Things (IoT) and enables a practitioner to make the right technology choice. The steps to strategize a big data implementation are also discussed in detail. This book presents a holistic approach to the topic, covering a wide landscape of big data technologies like Hadoop 2.0 and package implementations, such as Cloudera. In-depth discussion of associated technologies, such as MapReduce, Hive, Pig, Oozie, ApacheZookeeper, Flume, Kafka, Spark, Python and NoSQL databases like Cassandra, MongoDB, GraphDB, etc., is also included.

Table of contents

  1. Cover
  2. About Pearson
  3. Tittle
  4. Copyright
  5. Dedication
  6. Brief Contents
  7. Contents (1/2)
  8. Contents (2/2)
  9. Preface
  10. Acknowledgements
  11. About the Authors
  12. Model Syllabus for Big Data
  13. Lesson Plan
  14. Chapter 1 A Closer Look at Data
    1. 1.1 Introduction
    2. 1.2 Types of Data
      1. 1.2.1 Structured Data
      2. 1.2.2 Unstructured Data
      3. 1.2.3 Semi-Structured Data
    3. 1.3 The Emergence of ‘New Data’
    4. 1.4 ‘New’ Data and ‘Traditional’ Data Compared
    5. Summary
    6. Multiple-choice Questions (1 Mark Questions)
    7. Short-answer Type Questions (5 Marks Questions)
    8. Long-answer Type Questions (10 Marks Questions)
  15. Chapter 2 Introducing Big Data
    1. 2.1 Introduction
    2. 2.2 The Transition to Big Data
    3. 2.3 The Definition of Big Data
    4. 2.4 The V’s
    5. 2.5 Sources of Big Data
    6. 2.6 Common Applications of Big Data
    7. 2.7 An Introduction to Big Data Technologies
      1. 2.7.1 Hadoop
      2. 2.7.2 MapReduce
      3. 2.7.3 Hadoop Affiliate Technologies
      4. 2.7.4 Massively Parallel Processing
      5. 2.7.5 NoSQL
      6. 2.7.6 Hadoop Hybrids
    8. 2.8 An Overview of Popular Vendors
      1. 2.8.1 Hadoop Distributions
      2. 2.8.2 Hadoop in the Cloud
      3. 2.8.3 HDFS-Alternative Products
      4. 2.8.4 NoSQL
      5. 2.8.5 MPP Products
      6. 2.8.6 Hybrids
      7. 2.8.7 Data Integration, Visualization, Analytics
      8. 2.8.8 Business Intelligence (BI)
    9. Summary
    10. Multiple-choice Questions (1 Mark Questions)
    11. Short-answer Type Questions (5 Marks Questions)
    12. Long-answer Type Questions (10 Marks Questions)
  16. Chapter 3 Introducing Hadoop
    1. 3.1 Introduction
    2. 3.2 An Overview of Hadoop
    3. 3.3 Configuring a Hadoop Cluster (1/2)
    4. 3.3 Configuring a Hadoop Cluster (2/2)
    5. 3.4 Storing Data with HDFS
      1. 3.4.1 The NameNode and DataNodes
      2. 3.4.2 Storing and Reading Files from HDFS
      3. 3.4.3 Fault Tolerance with Replication
      4. 3.4.4 NameNode Failure Management
    6. 3.5 HDFS Technical Commands
    7. 3.6 Hadoop Distributions
    8. 3.7 Hadoop in the Cloud
    9. Summary
    10. Multiple-choice Questions (1 Mark Questions)
    11. Short-answer Type Questions (5 Marks Questions)
    12. Long-answer Type Questions (10 Marks Questions)
  17. Chapter 4 Introducing MapReduce
    1. 4.1 Introduction
    2. 4.2 Processing Data with MapReduce
      1. 4.2.1 A MapReduce Example
      2. 4.2.2 Technical Flow of a MapReduce Job
      3. 4.2.3 End-to-End Technical Anatomy of a MapReduce Job
    3. 4.3 Parallelism in Map and Reduce Phases
      1. 4.3.1 Using a Single Reducer
      2. 4.3.2 Using Multiple Reducers
    4. 4.4 Optimize the Map Phase Using a Combiner
      1. 4.4.1 Reducers as Combiners
    5. 4.5 What is YARN?
      1. 4.5.1 Scheduling and Managing Tasks
      2. 4.5.2 Job Execution in the Hadoop Cluster
      3. 4.5.3 Troubleshoot a MapReduce Job in Hadoop Cluster
    6. 4.6 Example Use Case on MapReduce: Development and Execution Step-by-step (1/2)
    7. 4.6 Example Use Case on MapReduce: Development and Execution Step-by-step (2/2)
    8. Summary
    9. Multiple-choice Questions (1 Mark Questions)
    10. Short-answer Type Questions (5 Marks Questions)
    11. Long-answer Type Questions (10 Marks Questions)
  18. Chapter 5 Introducing NoSQL
    1. 5.1 Introduction
    2. 5.2 NoSQL Databases in the Light of CAP Theorem
    3. 5.3 NoSQL Product Categories
      1. 5.3.1 Key-value Stores
      2. 5.3.2 Wide Column Stores or Columnar Stores
      3. 5.3.3 Document Stores
      4. 5.3.4 Graph Databases
    4. 5.4 NoSQL Database: Cassandra
      1. 5.4.1 Characteristics of Cassandra
      2. 5.4.2 Cassandra Architecture
      3. 5.4.3 Components of Cassandra
      4. 5.4.4 Cassandra Write Operations at a Node Level
      5. 5.4.5 Cassandra Node Level Read Operation
      6. 5.4.6 KEYSPACE in Cassandra
      7. 5.4.7 Starting Cassandra Server and Cqlsh Query Editor
      8. 5.4.8 DataStax Distribution Package
    5. 5.5 NoSQL Databases in the Cloud
    6. 5.6 NoSQL – Do’s and Don’ts
    7. 5.7 Business Intelligence and NoSQL
    8. 5.8 Big Data and NoSQL
    9. Summary
    10. Multiple-choice Questions (1 Mark Questions)
    11. Short-answer Type Questions (5 Marks Questions)
    12. Long-answer Type Questions (10 Marks Questions)
  19. Chapter 6 Introducing Spark and Kafka
    1. 6.1 Introducing Spark
      1. 6.1.1 Hadoop and Spark
      2. 6.1.2 Spark Programming Languages
      3. 6.1.3 Understanding Spark Architecture
      4. 6.1.4 Spark Libraries: Spark SQL
      5. 6.1.5 Spark Libraries: Streaming
      6. 6.1.6 Spark Libraries: Machine Learning
      7. 6.1.7 Spark Libraries: GraphX
      8. 6.1.8 PySpark: Spark with Python
    2. 6.2 Working with Kafka
      1. 6.2.1 What is Apache Kafka
      2. 6.2.2 Kafka Architecture
      3. 6.2.3 Need of Apache Kafka in Big Data
      4. 6.2.4 Kafka Use Cases
      5. 6.2.5 Why is Kafka so Fast?
      6. 6.2.6 Kafka Needs ZooKeeper
      7. 6.2.7 Different Components in Kafka
      8. 6.2.8 Difference between Apache Kafka and Apache Flume
      9. 6.2.9 Kafka Demonstration—How Messages are Passing from Publisher to Consumer through a Topic
    3. Summary
    4. Multiple-choice Questions (1 Mark Questions)
    5. Short-answer Type Questions (5 Marks Questions)
    6. Long-answer Type Questions (10 Marks Questions)
  20. Chapter 7 Other BigData Tools and Technologies
    1. 7.1 Introduction
    2. 7.2 Hive
      1. 7.2.1 Hive Architecture
      2. 7.2.2 Data Flow in Hive
      3. 7.2.3 Data Types in Hive
      4. 7.2.4 Different Types of Tables in Hive (1/2)
      5. 7.2.4 Different Types of Tables in Hive (2/2)
      6. 7.2.5 Partitioning and Bucketing in Hive
    3. 7.3 Pig
      1. 7.3.1 Why Apache Pig
      2. 7.3.2 Features of Apache Pig
      3. 7.3.3 Apache Pig vs. MapReduce
      4. 7.3.4 Pig Architecture
    4. 7.4 Sqoop and Flume
      1. 7.4.1 SqoopEXPORT (Data Transfer from HDFS to MySQL)
      2. 7.4.2 Sqoop IMPORT (Importing Fresh Table from MySQL to HIVE)
      3. 7.4.3 Flume
      4. 7.4.4 Components of Flume
      5. 7.4.5 Configure Flume to Ingest Web Log Data from a Local Directory to HDFS
    5. 7.5 Oozie
      1. 7.5.1 Oozie Workflow
    6. 7.6 Lucene and Solr
      1. 7.6.1 Lucene in Search Applications
      2. 7.6.2 Features of Apache Solr
      3. 7.6.3 Apache Solr—Basic Commands
    7. 7.7 Zookeeper
    8. 7.8 Apache NiFi
      1. 7.8.1 What Apache NiFi Does
    9. Summary
    10. Multiple-choice Questions (1 Mark Questions)
    11. Short-answer Type Questions (5 Marks Questions)
    12. Long-answer Type Questions (10 Marks Questions)
  21. Chapter 8 Working with Big Data in R
    1. 8.1 Prerequisites
      1. 8.1.1 Install R in Your System
      2. 8.1.2 Know How to Manage R Scripts
      3. 8.1.3 Introduction to Basic R Commands (1/3)
      4. 8.1.3 Introduction to Basic R Commands (2/3)
      5. 8.1.3 Introduction to Basic R Commands (3/3)
    2. 8.2 Exploratory Data Analysis
      1. 8.2.1 Basic Statistical Techniques for Data Exploration
      2. 8.2.2 Basic Plots for Data Exploration
    3. 8.3 R Libraries for Dealing with Large Data Sets
      1. 8.3.1 ff and ffbase Packages
      2. 8.3.2 Parallel Package
      3. 8.3.3 data.table Package
    4. 8.4 Integrating Hadoop with R
    5. 8.5 Simple R Program with Hadoop
    6. Summary
    7. Multiple-choice Questions (1 Mark Questions)
    8. Short-answer Type Questions (5 Marks Questions)
    9. Long-answer Type Questions (10 Marks Questions)
  22. Chapter 9 Working with Big Data in Python
    1. 9.1 Prerequisites
      1. 9.1.1 Install Python in Your System
      2. 9.1.2 Know How to Manage Python Scripts
      3. 9.1.3 Introduction to Basic Python Commands
    2. 9.2 Basic Libraries in Python
      1. 9.2.1 NumPy Library (1/2)
      2. 9.2.1 NumPy Library (2/2)
      3. 9.2.2 Pandas Library
      4. 9.2.3 Matplotlib Library
    3. 9.3 Python Libraries for Dealing with Large Data Sets
      1. 9.3.1 numpy.memmap Object
      2. 9.3.2 Parallel Computing Using mp4pi Library
    4. 9.4 Python-MapReduce Using Hadoop Streaming
      1. 9.4.1 What is Hadoop Streaming?
      2. 9.4.2 Python MapReduce Code
      3. 9.4.3 Step by Step Execution
      4. 9.4.4 Running the MapReduce Python Code on Hadoop
    5. Summary
    6. Multiple-choice Questions (1 Mark Questions)
    7. Short-answer Type Questions (5 Marks Questions)
    8. Long-answer Type Questions (10 Marks Questions)
  23. Chapter 10 Big Data Applied
    1. 10.1 Introduction
    2. 10.2 Big Data and Data Science
      1. 10.2.1 What is Data Science?
      2. 10.2.2 Who is a Data Scientist?
      3. 10.2.3 How Do We Do Define ‘Data Science’?
      4. 10.2.4 Common Pitfalls of Data Science
    3. 10.3 Big Data and IoT
      1. 10.3.1 What is IoT?
      2. 10.3.2 Overview of IoT Architecture
      3. 10.3.3 IoT in Action
      4. 10.3.4 Impacts of IoT
      5. 10.3.5 Applications of Big Data and IoT
    4. 10.4 Big Data and Recommendation Engines
      1. 10.4.1 What is a Recommendation?
      2. 10.4.2 What are Recommendation Engines?
      3. 10.4.3 What are the Types of Recommendation Engines?
      4. 10.4.4 How is Big Data Used in a Recommendation Engine?
    5. Summary
    6. Multiple-choice Questions (1 Mark Questions)
    7. Short-answer Type Questions (5 Marks Questions)
    8. Long-answer Type Questions (10 Marks Questions)
  24. Chapter 11 Big Data Strategy
    1. 11.1 Introduction
    2. 11.2 Two Typical Big Data Use Cases
      1. 11.2.1 Big Data Primarily for Cost Reduction
      2. 11.2.2 Big Data Primarily for Enhanced Value
    3. 11.3 Data Warehouses vs. Data Lakes—What is Your Strategy?
      1. 11.3.1 Differences between Data Warehouse and Data Lake
    4. 11.4 Key Questions to Ask
    5. 11.5 Getting Ready for a Big Data Program
    6. 11.6 Making Technology Choices
    7. 11.7 Making Tooling Choices
    8. Summary
    9. Short-answer Type Questions (5 Marks Questions)
    10. Long-answer Type Questions (10 Marks Questions)
  25. Chapter 12 Case Study: Retail Near Real-time Analytics
    1. 12.1 Introduction to Retail Domain
      1. 12.1.1 What is Retail in the First Place?
      2. 12.1.2 So, Why is Retailing So Important?
    2. 12.2 Near Real-time Analytics: Problem Statement
    3. 12.3 NRT Analytics: Solution Approach
    4. 12.4 NRT Analytics: Details of Solution Implemented (1/3)
    5. 12.4 NRT Analytics: Details of Solution Implemented (2/3)
    6. 12.4 NRT Analytics: Details of Solution Implemented (3/3)
      1. 12.4.1 Data from Producer
      2. 12.4.2 Output After Running Analysis Using Spark
      3. 12.4.3 Data Saved in Cassandra
      4. 12.4.4 Kafka Producer Streamed in Batch Mode After Every 2 Minutes
      5. 12.4.5 Data Streamed After 2 Minutes Containing the New Data
      6. 12.4.6 New Data Got Entered in Cassandra
    7. Summary
    8. Multiple-choice Questions (1 Mark Questions)
    9. Short-answer Type Questions (5 Marks Questions)
  26. Appendix (1/2)
  27. Appendix (2/2)
  28. Index

Product information

  • Title: Big Data Simplified
  • Author(s): Sayan Goswami, Amit Kumar Das, Sourabh Mukherjee
  • Release date: June 2019
  • Publisher(s): Pearson Education India
  • ISBN: None