book

Hadoop Blueprints

Name: Hadoop Blueprints
ISBN: 9781783980307

by Sudheesh Narayan, Tanmay Deshpande, Anurag Shrivastava

September 2016

Intermediate to advanced

316 pages

6h 43m

English

Packt Publishing

Read now

Unlock full access

Hadoop Blueprints
Hadoop Blueprints
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions

Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Hadoop and Big Data
The beginning of the big data problemLimitations of RDBMS systemsScaling out a database on GoogleParallel processing of large datasets
Building open source Hadoop
Enterprise Hadoop
Social media and mobile channelsData storage cost reductionEnterprise software vendorsPure Play Hadoop vendorsCloud Hadoop vendors
The design of the Hadoop system
The Hadoop Distributed File System (HDFS)Data organization in HDFSHDFS file management commandsNameNode and DataNodesMetadata store in NameNodePreventing a single point of failure with Hadoop HACheckpointing processData Store on a DataNodeHandshakes and heartbeats
MapReduce
The execution model of MapReduce Version 1Apache YARN
Building a MapReduce Version 2 program
Problem statementSolution workflowGetting the datasetStudying the datasetCleaning the datasetLoading the dataset on the HDFSStarting with a MapReduce programInstalling EclipseCreating a project in EclipseCoding and building a MapReduce programRun the MapReduce program locallyExamine the resultRun the MapReduce program on HadoopFurther processing of results
Hadoop platform tools
Data ingestion toolsData access toolsMonitoring toolsData governance tools
Big data use cases
Creating a 360 degree view of a customerFraud detection systems for banksMarketing campaign planningChurn detection in telecomAnalyzing sensor dataBuilding a data lake
The architecture of Hadoop-based systems
Lambda architecture
Summary
2. A 360-Degree View of the Customer
Capturing business informationCollecting data from data sourcesCreating a data processing approachPresenting the results
Setting up the technology stack
Tools usedInstalling Hortonworks SandboxCreating user accountsExploring HUEExploring MYSQL and the HIVE command lineExploring Sqoop at the command line
Test driving Hive and Sqoop
Querying data using HiveImporting data in Hive using Sqoop
Engineering the solution
DatasetsLoading customer master data into HadoopLoading web logs into HadoopLoading tweets into HadoopCreating the 360-degree viewExporting data from Hadoop
Presenting the view
Building a web applicationInstalling Node.jsCoding the web application in Node.js
Summary
3. Building a Fraud Detection System
Understanding the business problem
Selecting and cleansing the dataset
Finding relevant fields
Machine learning for fraud detection
Clustering as an unsupervised machine learning method
Designing the high-level architecture
Introducing Apache SparkApache Spark architectureResilient Distributed DatasetsTransformation functionsActionsTest driving Apache SparkCalculating the yearly average stock prices using SparkApache Spark 2.XUnderstanding MLibTest driving K-means using MLib
Creating our fraud detection model
Building our K-means clustering modelProcessing the data
Putting the fraud detection model to use
Generating a data streamProcessing the data stream using Spark streamingPutting the model to useScaling the solutionSummary
4. Marketing Campaign Planning
Creating the solution outline
Supervised learning
Tree-structure models for classification
Finding the right dataset
Setting the up the solution architecture
Coupon scan at POSJoin and transformTrain the classification modelScoringMail merge
Building the machine learning model
Introducing BigMLModel building stepsSign up as a user on BigML siteUpload the data fileCreating the datasetBuilding the classification modelDownloading the classification model
Running the Model on Hadoop
Creating the target List
Post campaign activities
Summary
5. Churn Detection
A business case for churn detection
Creating the solution outline
Building a predictive model using HadoopBayes' TheoremPlaying with the Bayesian predictorRunning a Node.js-based Bayesian predictorUnderstanding the predictor codeLimitations of our solution
Building a churn predictor using Hadoop
Synthetic data generation toolsPreparing a synthetic historical churn datasetThe processing approachRunning the MapReduce programUnderstanding the frequency counter codePutting the model to useIntegrating the churn predictor
Summary
6. Analyze Sensor Data Using Hadoop
A business case for sensor data analytics
Creating the solution outline
Technology stack
KafkaFlumeHDFSHiveOpen TSDBHBaseGrafana
Batch data analytics
Loading streams of sensor data from Kafka topics to HDFSUsing Hive to perform analytics on inserted dataData visualization in MS Excel
Stream data analytics
Loading streams of sensor dataData visualization using Grafana
Summary
7. Building a Data Lake
Data lake building blocksIngestion tierStorage tierInsights tierOps facilitiesLimitation of open source Hadoop ecosystem tools
Hadoop security
HDFS permissions modelFine-grained permissions with HDFS ACLs
Apache Ranger
Installing Apache RangerTest driving Apache RangerDefine services and access policiesExamine the audit logsViewing users and groups in RangerData Lake security with Apache Ranger
Apache Flume
Understanding the Design of FlumeInstalling Apache FlumeRunning Apache Flume
Apache Zeppelin
Installation of Apache ZeppelinTest driving ZeppelinExploring data visualization features of ZeppelinDefine the gold price movement table in HiveLoad gold price history in the TableRun a select queryPlot price change per monthRunning the paragraphZeppelin in Data Lake
Technology stack for Data Lake
Data Lake business requirements
Understanding the business requirementsUnderstanding the IT systems and securityDesigning the data pipelineBuilding the data pipelineSetting up the access controlSynchronizing the users and groups in RangerSetting up data access policies in RangerRestricting the access in ZeppelinTesting our data pipelineScheduling the data loadingRefining the business requirementsImplementing the new requirementsLoading the stock holding data in Data LakeRestricting the access to stock holding data in Data LakeTesting the Loaded Data with ZeppelinAdding stock feed in the Data LakeFetching data from Yahoo ServiceConfiguring FlumeRunning Flume as Stock Feeder to Data LakeTransforming the data in Data LakeGrowing Data Lake
Summary
8. Future Directions
Hadoop solutions teamThe role of the data engineerData science for non-expertsFrom the data science model to business value
Hadoop on Cloud
Deploying Hadoop on cloud serversUsing Hadoop as a service
NoSQL databases
Types of NoSQL databasesCommon observations about NoSQL databasesIn-memory databasesApache Ignite as an in-memory databaseApache Ignite as a Hadoop acceleratorApache Spark versus Apache Ignite
Summary

Content preview from Hadoop Blueprints

Creating the target List

Now our MapReduce program is ready to run on the Hadoop cluster. We are now going to prepare the input data from the customer master database of Furnitica. The customer master data contains many details that might not be very relevant for our MapReduce job.

A subset of fields available in the master data is as follows:

Customer ID
Date of birth
Income
Gender

Let us assume here that we will now make a selection of customers living in the city where we are going to send the campaign folders. This city is the target of the campaign. A single row in our selection is shown in Table 3:

Customer ID	10023
Age (derived from date of birth)	55
Income	75000
Gender (derived from M/F, where 0 is male and 1 is female)	0

Table 3 A selection ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781783980307

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hadoop Blueprints

by Sudheesh Narayan, Tanmay Deshpande, Anurag Shrivastava

Creating the target List

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.