book

Hadoop: The Definitive Guide, 3rd Edition

Name: Hadoop: The Definitive Guide, 3rd Edition
Author: Tom White
ISBN: 9781449311520

by Tom White

May 2012

Intermediate to advanced

682 pages

22h 19m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Hadoop: The Definitive Guide
Dedication
Foreword
Preface
Administrative NotesWhat’s in This Book?What’s New in the Second Edition?What’s New in the Third Edition?Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Meet Hadoop
Data!Data Storage and AnalysisComparison with Other SystemsRational Database Management SystemGrid ComputingVolunteer ComputingA Brief History of HadoopApache Hadoop and the Hadoop EcosystemHadoop ReleasesWhat’s Covered in This BookConfiguration namesMapReduce APIsCompatibility
2. MapReduce
A Weather DatasetData FormatAnalyzing the Data with Unix ToolsAnalyzing the Data with HadoopMap and ReduceJava MapReduceA test runThe old and the new Java MapReduce APIsScaling OutData FlowCombiner FunctionsSpecifying a combiner functionRunning a Distributed MapReduce JobHadoop StreamingRubyPythonHadoop PipesCompiling and Running
3. The Hadoop Distributed Filesystem
The Design of HDFSHDFS ConceptsBlocksNamenodes and DatanodesHDFS FederationHDFS High-AvailabilityFailover and fencingThe Command-Line InterfaceBasic Filesystem OperationsHadoop FilesystemsInterfacesHTTPCFUSEThe Java InterfaceReading Data from a Hadoop URLReading Data Using the FileSystem APIFSDataInputStreamWriting DataFSDataOutputStreamDirectoriesQuerying the FilesystemFile metadata: FileStatusListing filesFile patternsPathFilterDeleting DataData FlowAnatomy of a File ReadAnatomy of a File WriteCoherency ModelConsequences for application designData Ingest with Flume and SqoopParallel Copying with distcpKeeping an HDFS Cluster BalancedHadoop ArchivesUsing Hadoop ArchivesLimitations
4. Hadoop I/O
Data IntegrityData Integrity in HDFSLocalFileSystemChecksumFileSystemCompressionCodecsCompressing and decompressing streams with CompressionCodecInferring CompressionCodecs using CompressionCodecFactoryNative librariesCodecPoolCompression and Input SplitsUsing Compression in MapReduceCompressing map outputSerializationThe Writable InterfaceWritableComparable and comparatorsWritable ClassesWritable wrappers for Java primitivesTextIndexingUnicodeIterationMutabilityResorting to StringBytesWritableNullWritableObjectWritable and GenericWritableWritable collectionsImplementing a Custom WritableImplementing a RawComparator for speedCustom comparatorsSerialization FrameworksSerialization IDLAvroAvro Data Types and SchemasIn-Memory Serialization and DeserializationThe specific APIAvro DatafilesInteroperabilityPython APIC APISchema ResolutionSort OrderAvro MapReduceSorting Using Avro MapReduceAvro MapReduce in Other LanguagesFile-Based Data StructuresSequenceFileWriting a SequenceFileReading a SequenceFileDisplaying a SequenceFile with the command-line interfaceSorting and merging SequenceFilesThe SequenceFile formatMapFileWriting a MapFileReading a MapFileMapFile variantsConverting a SequenceFile to a MapFile
5. Developing a MapReduce Application
The Configuration APICombining ResourcesVariable ExpansionSetting Up the Development EnvironmentManaging ConfigurationGenericOptionsParser, Tool, and ToolRunnerWriting a Unit Test with MRUnitMapperReducerRunning Locally on Test DataRunning a Job in a Local Job RunnerFixing the mapperTesting the DriverRunning on a ClusterPackaging a JobThe client classpathThe task classpathPackaging dependenciesTask classpath precedenceLaunching a JobThe MapReduce Web UIThe jobtracker pageThe job pageRetrieving the ResultsDebugging a JobThe tasks pageThe task details pageHandling malformed dataHadoop LogsRemote DebuggingTuning a JobProfiling TasksThe HPROF profilerOther profilersMapReduce WorkflowsDecomposing a Problem into MapReduce JobsJobControlApache OozieDefining an Oozie workflowPackaging and deploying an Oozie workflow applicationRunning an Oozie workflow job
6. How MapReduce Works
Anatomy of a MapReduce Job RunClassic MapReduce (MapReduce 1)Job submissionJob initializationTask assignmentTask executionStreaming and pipesProgress and status updatesJob completionYARN (MapReduce 2)Job submissionJob initializationTask assignmentTask executionProgress and status updatesJob completionFailuresFailures in Classic MapReduceTask failureTasktracker failureJobtracker failureFailures in YARNTask failureApplication master failureNode manager failureResource manager failureJob SchedulingThe Fair SchedulerThe Capacity SchedulerShuffle and SortThe Map SideThe Reduce SideConfiguration TuningTask ExecutionThe Task Execution EnvironmentStreaming environment variablesSpeculative ExecutionOutput CommittersTask side-effect filesTask JVM ReuseSkipping Bad Records

7. MapReduce Types and Formats
MapReduce TypesThe Default MapReduce JobThe default Streaming jobKeys and values in StreamingInput FormatsInput Splits and RecordsFileInputFormatFileInputFormat input pathsFileInputFormat input splitsSmall files and CombineFileInputFormatPreventing splittingFile information in the mapperProcessing a whole file as a recordText InputTextInputFormatKeyValueTextInputFormatNLineInputFormatXMLBinary InputSequenceFileInputFormatSequenceFileAsTextInputFormatSequenceFileAsBinaryInputFormatMultiple InputsDatabase Input (and Output)Output FormatsText OutputBinary OutputSequenceFileOutputFormatSequenceFileAsBinaryOutputFormatMapFileOutputFormatMultiple OutputsAn example: Partitioning dataMultipleOutputsLazy OutputDatabase Output
8. MapReduce Features
CountersBuilt-in CountersTask countersJob countersUser-Defined Java CountersDynamic countersReadable counter namesRetrieving countersUsing the new MapReduce APIUser-Defined Streaming CountersSortingPreparationPartial SortAn application: Partitioned MapFile lookupsTotal SortSecondary SortJava codeStreamingJoinsMap-Side JoinsReduce-Side JoinsSide Data DistributionUsing the Job ConfigurationDistributed CacheUsageHow it worksThe distributed cache APIMapReduce Library Classes
9. Setting Up a Hadoop Cluster
Cluster SpecificationNetwork TopologyRack awarenessCluster Setup and InstallationInstalling JavaCreating a Hadoop UserInstalling HadoopTesting the InstallationSSH ConfigurationHadoop ConfigurationConfiguration ManagementControl scriptsMaster node scenariosEnvironment SettingsMemoryJavaSystem logfilesSSH settingsImportant Hadoop Daemon PropertiesHDFSMapReduceHadoop Daemon Addresses and PortsOther Hadoop PropertiesCluster membershipBuffer sizeHDFS block sizeReserved storage spaceTrashJob schedulerReduce slow startTask memory limitsUser Account CreationYARN ConfigurationImportant YARN Daemon PropertiesMemoryYARN Daemon Addresses and PortsSecurityKerberos and HadoopAn exampleDelegation TokensOther Security EnhancementsBenchmarking a Hadoop ClusterHadoop BenchmarksBenchmarking HDFS with TestDFSIOBenchmarking MapReduce with SortOther benchmarksUser JobsHadoop in the CloudApache WhirrSetupLaunching a clusterConfigurationRunning a proxyRunning a MapReduce jobShutting down a cluster
10. Administering Hadoop
HDFSPersistent Data StructuresNamenode directory structureThe filesystem image and edit logSecondary namenode directory structureDatanode directory structureSafe ModeEntering and leaving safe modeAudit LoggingToolsdfsadminFilesystem check (fsck)Finding the blocks for a fileDatanode block scannerBalancerMonitoringLoggingSetting log levelsGetting stack tracesMetricsFileContextGangliaContextNullContextWithUpdateThreadCompositeContextJava Management ExtensionsMaintenanceRoutine Administration ProceduresMetadata backupsData backupsFilesystem check (fsck)Filesystem balancerCommissioning and Decommissioning NodesCommissioning new nodesDecommissioning old nodesUpgradesHDFS data and metadata upgradesStart the upgradeWait until the upgrade is completeCheck the upgradeRoll back the upgrade (optional)Finalize the upgrade (optional)
11. Pig
Installing and Running PigExecution TypesLocal modeMapReduce modeRunning Pig ProgramsGruntPig Latin EditorsAn ExampleGenerating ExamplesComparison with DatabasesPig LatinStructureStatementsExpressionsTypesSchemasValidation and nullsSchema mergingFunctionsMacrosUser-Defined FunctionsA Filter UDFLeveraging typesAn Eval UDFDynamic invokersA Load UDFUsing a schemaData Processing OperatorsLoading and Storing DataFiltering DataFOREACH...GENERATESTREAMGrouping and Joining DataJOINCOGROUPCROSSGROUPSorting DataCombining and Splitting DataPig in PracticeParallelismParameter SubstitutionDynamic parametersParameter substitution processing
12. Hive
Installing HiveThe Hive ShellAn ExampleRunning HiveConfiguring HiveLoggingHive ServicesHive clientsThe MetastoreComparison with Traditional DatabasesSchema on Read Versus Schema on WriteUpdates, Transactions, and IndexesHiveQLData TypesPrimitive typesComplex typesOperators and FunctionsConversionsTablesManaged Tables and External TablesPartitions and BucketsPartitionsBucketsStorage FormatsThe default storage format: Delimited textBinary storage formats: Sequence files, Avro datafiles and RCFilesAn example: RegexSerDeImporting DataInsertsMultitable insertCREATE TABLE...AS SELECTAltering TablesDropping TablesQuerying DataSorting and AggregatingMapReduce ScriptsJoinsInner joinsOuter joinsSemi joinsMap joinsSubqueriesViewsUser-Defined FunctionsWriting a UDFWriting a UDAFA more complex UDAF
13. HBase
HBasicsBackdropConceptsWhirlwind Tour of the Data ModelRegionsLockingImplementationHBase in operationInstallationTest DriveClientsJavaMapReduceAvro, REST, and ThriftRESTThriftAvroExampleSchemasLoading DataOptimization notesWeb QueriesHBase Versus RDBMSSuccessful ServiceHBaseUse Case: HBase at Streamy.comVery large items tablesVery large sort mergesLife with HBasePraxisVersionsHDFSUIMetricsSchema DesignJoinsRow keysCountersBulk Load
14. ZooKeeper
Installing and Running ZooKeeperAn ExampleGroup Membership in ZooKeeperCreating the GroupJoining a GroupListing Members in a GroupZooKeeper command-line toolsDeleting a GroupThe ZooKeeper ServiceData ModelEphemeral znodesSequence numbersWatchesOperationsMultiupdateAPIsWatch triggersACLsImplementationConsistencySessionsTimeStatesBuilding Applications with ZooKeeperA Configuration ServiceThe Resilient ZooKeeper ApplicationInterruptedExceptionKeeperExceptionState exceptionsRecoverable exceptionsUnrecoverable exceptionsA reliable configuration serviceA Lock ServiceThe herd effectRecoverable exceptionsUnrecoverable exceptionsImplementationMore Distributed Data Structures and ProtocolsBookKeeper and HedwigZooKeeper in ProductionResilience and PerformanceConfiguration
15. Sqoop
Getting SqoopSqoop ConnectorsA Sample ImportText and Binary File FormatsGenerated CodeAdditional Serialization SystemsImports: A Deeper LookControlling the ImportImports and ConsistencyDirect-mode ImportsWorking with Imported DataImported Data and HiveImporting Large ObjectsPerforming an ExportExports: A Deeper LookExports and TransactionalityExports and SequenceFiles
16. Case Studies
Hadoop Usage at Last.fmLast.fm: The Social Music RevolutionHadoop at Last.fmGenerating Charts with HadoopThe Track Statistics ProgramCalculating the number of unique listenersUniqueListenersMapperUniqueListenersReducerSumming the track totalsSumMapperSumReducerMerging the resultsMergeListenersMapperIdentityMapperSumReducerSummaryHadoop and Hive at FacebookHadoop at FacebookHistoryUse casesData architectureHadoop configurationHypothetical Use Case StudiesAdvertiser insights and performanceAd hoc analysis and product feedbackData analysisHiveData organizationQuery languageData pipelines using HiveProblems and Future WorkFair sharingSpace managementScribe-HDFS integrationImprovements to HiveNutch Search EngineData StructuresCrawlDbLinkDbSegmentsSelected Examples of Hadoop Data Processing in NutchLink inversionGeneration of fetchlistsStep 1: Select, sort by score, limit by URL count per hostStep 2: Invert, partition by host, sort randomlyFetcher: A multithreaded MapRunner in actionIndexer: Using custom OutputFormatSummaryLog Processing at RackspaceRequirements/The ProblemLogsBrief HistoryChoosing HadoopCollection and StorageLog collectionLog storageMapReduce for LogsProcessingPhase 1: MapPhase 1: ReducePhase 2: MapPhase 2: ReduceMerging for near-term searchShardingSearch resultsArchiving for analysisCascadingFields, Tuples, and PipesOperationsTaps, Schemes, and FlowsCascading in PracticeFlexibilityHadoop and Cascading at ShareThisSummaryTeraByte Sort on Apache HadoopUsing Pig and Wukong to Explore Billion-edge Network GraphsMeasuring CommunityEverybody’s Talkin’ at Me: The Twitter Reply GraphEdge pairs versus adjacency listDegreeSymmetric LinksCommunity ExtractionGet neighborsCommunity metrics and the 1 million × 1 million problemLocal properties at global scale
A. Installing Apache Hadoop
PrerequisitesInstallationConfigurationStandalone ModePseudodistributed ModeConfiguring SSHFormatting the HDFS filesystemStarting and stopping the daemons (MapReduce 1)Starting and stopping the daemons (MapReduce 2)Fully Distributed Mode
B. Cloudera’s Distribution Including Apache Hadoop
C. Preparing the NCDC Weather Data
Index
About the Author
Colophon
Copyright

Content preview from Hadoop: The Definitive Guide, 3rd Edition

Chapter 1. Meet Hadoop

In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.
—Grace Hopper

Data!

We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006 and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.^[2] A zettabyte is 10²¹ bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world.

This flood of data is coming from many sources. Consider the following:^[3]

The New York Stock Exchange generates about one terabyte of new trade data per day.
Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
The Internet Archive stores around 2 petabytes of data and is growing at a rate of 20 terabytes per month.
The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.

So there’s a lot of data out there. But you are probably wondering how it affects you. Most of the data is locked up in the largest web properties (like search engines) or in scientific or financial institutions, isn’t it? Does the advent of “Big Data,” as it is being called, affect smaller ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hadoop: The Definitive Guide, 4th Edition

Publisher Resources

ISBN: 9781449328917Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hadoop: The Definitive Guide, 3rd Edition

by Tom White

Chapter 1. Meet Hadoop

Data!

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.