book

Hadoop Essentials

Name: Hadoop Essentials
Author: Shiva Achari
ISBN: 9781784396688

by Shiva Achari

April 2015

Beginner to intermediate

194 pages

4h 18m

English

Packt Publishing

Read now

Unlock full access

Hadoop Essentials
Table of Contents
Hadoop Essentials
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book

Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Introduction to Big Data and Hadoop
V's of big dataVolumeVelocityVariety
Understanding big data
NoSQLTypes of NoSQL databasesAnalytical database
Who is creating big data?
Big data use cases
Big data use case patterns
Big data as a storage patternBig data as a data transformation patternBig data for a data analysis patternBig data for data in a real-time patternBig data for a low latency caching pattern
Hadoop
Hadoop historyDescriptionAdvantages of HadoopUses of HadoopHadoop ecosystemApache HadoopHadoop distributions
Pillars of Hadoop
Data access components
Data storage component
Data ingestion in Hadoop
Streaming and real-time analysis
Summary
2. Hadoop Ecosystem
Traditional systemsDatabase trend
The Hadoop use cases
Hadoop's basic data flow
Hadoop integration
The Hadoop ecosystem
Distributed filesystem
HDFS
Distributed programming
NoSQL databases
Apache HBase
Data ingestion
Service programming
Apache YARNApache Zookeeper
Scheduling
Data analytics and machine learning
System management
Apache Ambari
Summary
3. Pillars of Hadoop – HDFS, MapReduce, and YARN
HDFSFeatures of HDFSHDFS architectureNameNodeDataNodeCheckpoint NameNode or Secondary NameNodeBackupNodeData storage in HDFSRead pipelineWrite pipelineRack awarenessAdvantages of rack awareness in HDFSHDFS federationLimitations of HDFS 1.0The benefit of HDFS federationHDFS portsHDFS commands
MapReduce
The MapReduce architectureJobTrackerTaskTrackerSerialization data typesThe Writable interfaceWritableComparable interfaceThe MapReduce exampleThe MapReduce processMapperShuffle and sortingReducerSpeculative executionFileFormatsInputFormatsRecordReaderOutputFormatsRecordWriterWriting a MapReduce programMapper codeReducer codeDriver codeAuxiliary stepsCombinerPartitionerCustom partitioner
YARN
YARN architectureResourceManagerNodeManagerApplicationMasterApplications powered by YARN
Summary
4. Data Access Components – Hive and Pig
Need of a data processing tool on Hadoop
Pig
Pig data typesThe Pig architectureThe logical planThe physical planThe MapReduce planPig modesGrunt shellInput dataLoading dataDumpStoreFOREACH generateFilterGroup ByLimitAggregationCogroupDESCRIBEEXPLAINILLUSTRATE
Hive
The Hive architectureMetastoreThe Query compilerThe Execution engineData types and schemasInstalling HiveStarting Hive shellHiveQLDDL (Data Definition Language) operationsDML (Data Manipulation Language) operationsThe SQL operationJoinsAggregationsBuilt-in functionsCustom UDF (User Defined Functions)Managing tables – external versus managedSerDePartitioningBucketing
Summary
5. Storage Component – HBase
An Overview of HBase
Advantages of HBase
The Architecture of HBase
MasterServerRegionServerWALBlockCacheLRUBlockCacheSlabCacheBucketCacheRegionsMemStoreZookeeper
The HBase data model
Logical components of a data modelACID propertiesThe CAP theorem
The Schema design
The Write pipeline
The Read pipeline
Compaction
The Compaction policyMinor compactionMajor compaction
Splitting
Pre-SplittingAuto SplittingForced Splitting
Commands
helpCreateListPutScanGetDisableDrop
HBase Hive integration
Performance tuning
CompressionFiltersCountersHBase coprocessors
Summary
6. Data Ingestion in Hadoop – Sqoop and Flume
Data sources
Challenges in data ingestion
Sqoop
Connectors and drivers
Sqoop 1 architecture
Limitation of Sqoop 1
Sqoop 2 architecture
Imports
Exports
Apache Flume
Reliability
Flume architecture
Multitier topologyFlume masterFlume nodesComponents in AgentSourceSinkChannelsMemory channelFile ChannelJDBC Channel
Examples of configuring Flume
The Single agent exampleMultiple flows in an agentConfiguring a multiagent setup
Summary
7. Streaming and Real-time Analysis – Storm and Spark
An introduction to StormFeatures of StormPhysical architecture of StormData architecture of Storm
Storm topology
Storm on YARN
Topology configuration example
SpoutsBoltsTopology
An introduction to Spark
Features of Spark
Spark framework
Spark SQLGraphXMLibSpark streaming
Spark architecture
Directed Acyclic Graph engineResilient Distributed DatasetPhysical architecture
Operations in Spark
TransformationsActions
Spark example
Summary
Index

Content preview from Hadoop Essentials

Distributed filesystem

In Hadoop, we know that data is stored in a distributed computing environment, so the files are scattered across the cluster. We should have an efficient filesystem to manage the files in Hadoop. The filesystem used in Hadoop is HDFS, elaborated as Hadoop Distributed File System.

HDFS

HDFS is extremely scalable and fault tolerant. It is designed to efficiently process parallel processing in a distributed environment in even commodity hardware. HDFS has daemon processes in Hadoop, which manage the data. The processes are NameNode, DataNode, BackupNode, and Checkpoint NameNode.

We will discuss HDFS elaborately in the next chapter.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781784396688

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hadoop Essentials

by Shiva Achari

Distributed filesystem

HDFS

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.