book

Hadoop: The Definitive Guide, 4th Edition

Name: Hadoop: The Definitive Guide, 4th Edition
Author: Tom White
ISBN: 9781491901632

by Tom White

April 2015

Beginner to intermediate

754 pages

21h 39m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Dedication
Foreword
Preface
Administrative NotesWhat’s New in the Fourth Edition?What’s New in the Third Edition?What’s New in the Second Edition?Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
I. Hadoop Fundamentals
1. Meet Hadoop
Data!Data Storage and AnalysisQuerying All Your DataBeyond BatchComparison with Other SystemsRelational Database Management SystemsGrid ComputingVolunteer ComputingA Brief History of Apache HadoopWhat’s in This Book?
2. MapReduce
A Weather DatasetData FormatAnalyzing the Data with Unix ToolsAnalyzing the Data with HadoopMap and ReduceJava MapReduceA test runScaling OutData FlowCombiner FunctionsSpecifying a combiner functionRunning a Distributed MapReduce JobHadoop StreamingRubyPython
3. The Hadoop Distributed Filesystem
The Design of HDFSHDFS ConceptsBlocksNamenodes and DatanodesBlock CachingHDFS FederationHDFS High AvailabilityFailover and fencingThe Command-Line InterfaceBasic Filesystem OperationsHadoop FilesystemsInterfacesHTTPCNFSFUSEThe Java InterfaceReading Data from a Hadoop URLReading Data Using the FileSystem APIFSDataInputStreamWriting DataFSDataOutputStreamDirectoriesQuerying the FilesystemFile metadata: FileStatusListing filesFile patternsPathFilterDeleting DataData FlowAnatomy of a File ReadAnatomy of a File WriteCoherency ModelConsequences for application designParallel Copying with distcpKeeping an HDFS Cluster Balanced
4. YARN
Anatomy of a YARN Application RunResource RequestsApplication LifespanBuilding YARN ApplicationsYARN Compared to MapReduce 1Scheduling in YARNScheduler OptionsCapacity Scheduler ConfigurationQueue placementFair Scheduler ConfigurationEnabling the Fair SchedulerQueue configurationQueue placementPreemptionDelay SchedulingDominant Resource FairnessFurther Reading
5. Hadoop I/O
Data IntegrityData Integrity in HDFSLocalFileSystemChecksumFileSystemCompressionCodecsCompressing and decompressing streams with CompressionCodecInferring CompressionCodecs using CompressionCodecFactoryNative librariesCodecPoolCompression and Input SplitsUsing Compression in MapReduceCompressing map outputSerializationThe Writable InterfaceWritableComparable and comparatorsWritable ClassesWritable wrappers for Java primitivesTextIndexingUnicodeIterationMutabilityResorting to StringBytesWritableNullWritableObjectWritable and GenericWritableWritable collectionsImplementing a Custom WritableImplementing a RawComparator for speedCustom comparatorsSerialization FrameworksSerialization IDLFile-Based Data StructuresSequenceFileWriting a SequenceFileReading a SequenceFileDisplaying a SequenceFile with the command-line interfaceSorting and merging SequenceFilesThe SequenceFile formatMapFileMapFile variantsOther File Formats and Column-Oriented Formats
II. MapReduce

6. Developing a MapReduce Application
The Configuration APICombining ResourcesVariable ExpansionSetting Up the Development EnvironmentManaging ConfigurationGenericOptionsParser, Tool, and ToolRunnerWriting a Unit Test with MRUnitMapperReducerRunning Locally on Test DataRunning a Job in a Local Job RunnerTesting the DriverRunning on a ClusterPackaging a JobThe client classpathThe task classpathPackaging dependenciesTask classpath precedenceLaunching a JobThe MapReduce Web UIThe resource manager pageThe MapReduce job pageRetrieving the ResultsDebugging a JobThe tasks and task attempts pagesHandling malformed dataHadoop LogsRemote DebuggingTuning a JobProfiling TasksThe HPROF profilerMapReduce WorkflowsDecomposing a Problem into MapReduce JobsJobControlApache OozieDefining an Oozie workflowPackaging and deploying an Oozie workflow applicationRunning an Oozie workflow job
7. How MapReduce Works
Anatomy of a MapReduce Job RunJob SubmissionJob InitializationTask AssignmentTask ExecutionStreamingProgress and Status UpdatesJob CompletionFailuresTask FailureApplication Master FailureNode Manager FailureResource Manager FailureShuffle and SortThe Map SideThe Reduce SideConfiguration TuningTask ExecutionThe Task Execution EnvironmentStreaming environment variablesSpeculative ExecutionOutput CommittersTask side-effect files
8. MapReduce Types and Formats
MapReduce TypesThe Default MapReduce JobThe default Streaming jobKeys and values in StreamingInput FormatsInput Splits and RecordsFileInputFormatFileInputFormat input pathsFileInputFormat input splitsSmall files and CombineFileInputFormatPreventing splittingFile information in the mapperProcessing a whole file as a recordText InputTextInputFormatControlling the maximum line lengthKeyValueTextInputFormatNLineInputFormatXMLBinary InputSequenceFileInputFormatSequenceFileAsTextInputFormatSequenceFileAsBinaryInputFormatFixedLengthInputFormatMultiple InputsDatabase Input (and Output)Output FormatsText OutputBinary OutputSequenceFileOutputFormatSequenceFileAsBinaryOutputFormatMapFileOutputFormatMultiple OutputsAn example: Partitioning dataMultipleOutputsLazy OutputDatabase Output
9. MapReduce Features
CountersBuilt-in CountersTask countersJob countersUser-Defined Java CountersDynamic countersRetrieving countersUser-Defined Streaming CountersSortingPreparationPartial SortTotal SortSecondary SortJava codeStreamingJoinsMap-Side JoinsReduce-Side JoinsSide Data DistributionUsing the Job ConfigurationDistributed CacheUsageHow it worksThe distributed cache APIMapReduce Library Classes
III. Hadoop Operations
10. Setting Up a Hadoop Cluster
Cluster SpecificationCluster SizingMaster node scenariosNetwork TopologyRack awarenessCluster Setup and InstallationInstalling JavaCreating Unix User AccountsInstalling HadoopConfiguring SSHConfiguring HadoopFormatting the HDFS FilesystemStarting and Stopping the DaemonsCreating User DirectoriesHadoop ConfigurationConfiguration ManagementEnvironment SettingsJavaMemory heap sizeSystem logfilesSSH settingsImportant Hadoop Daemon PropertiesHDFSYARNMemory settings in YARN and MapReduceCPU settings in YARN and MapReduceHadoop Daemon Addresses and PortsOther Hadoop PropertiesCluster membershipBuffer sizeHDFS block sizeReserved storage spaceTrashJob schedulerReduce slow startShort-circuit local readsSecurityKerberos and HadoopAn exampleDelegation TokensOther Security EnhancementsBenchmarking a Hadoop ClusterHadoop BenchmarksBenchmarking MapReduce with TeraSortOther benchmarksUser Jobs
11. Administering Hadoop
HDFSPersistent Data StructuresNamenode directory structureThe filesystem image and edit logSecondary namenode directory structureDatanode directory structureSafe ModeEntering and leaving safe modeAudit LoggingToolsdfsadminFilesystem check (fsck)Finding the blocks for a fileDatanode block scannerBalancerMonitoringLoggingSetting log levelsGetting stack tracesMetrics and JMXMaintenanceRoutine Administration ProceduresMetadata backupsData backupsFilesystem check (fsck)Filesystem balancerCommissioning and Decommissioning NodesCommissioning new nodesDecommissioning old nodesUpgradesHDFS data and metadata upgradesStart the upgradeWait until the upgrade is completeCheck the upgradeRoll back the upgrade (optional)Finalize the upgrade (optional)
IV. Related Projects
12. Avro
Avro Data Types and SchemasIn-Memory Serialization and DeserializationThe Specific APIAvro DatafilesInteroperabilityPython APIAvro ToolsSchema ResolutionSort OrderAvro MapReduceSorting Using Avro MapReduceAvro in Other Languages
13. Parquet
Data ModelNested EncodingParquet File FormatParquet ConfigurationWriting and Reading Parquet FilesAvro, Protocol Buffers, and ThriftProjection and read schemasParquet MapReduce
14. Flume
Installing FlumeAn ExampleTransactions and ReliabilityBatchingThe HDFS SinkPartitioning and InterceptorsFile FormatsFan OutDelivery GuaranteesReplicating and Multiplexing SelectorsDistribution: Agent TiersDelivery GuaranteesSink GroupsIntegrating Flume with ApplicationsComponent CatalogFurther Reading
15. Sqoop
Getting SqoopSqoop ConnectorsA Sample ImportText and Binary File FormatsGenerated CodeAdditional Serialization SystemsImports: A Deeper LookControlling the ImportImports and ConsistencyIncremental ImportsDirect-Mode ImportsWorking with Imported DataImported Data and HiveImporting Large ObjectsPerforming an ExportExports: A Deeper LookExports and TransactionalityExports and SequenceFilesFurther Reading
16. Pig
Installing and Running PigExecution TypesLocal modeMapReduce modeRunning Pig ProgramsGruntPig Latin EditorsAn ExampleGenerating ExamplesComparison with DatabasesPig LatinStructureStatementsExpressionsTypesSchemasUsing Hive tables with HCatalogValidation and nullsSchema mergingFunctionsOther librariesMacrosUser-Defined FunctionsA Filter UDFLeveraging typesAn Eval UDFDynamic invokersA Load UDFUsing a schemaData Processing OperatorsLoading and Storing DataFiltering DataFOREACH...GENERATESTREAMGrouping and Joining DataJOINCOGROUPCROSSGROUPSorting DataCombining and Splitting DataPig in PracticeParallelismAnonymous RelationsParameter SubstitutionDynamic parametersParameter substitution processingFurther Reading
17. Hive
Installing HiveThe Hive ShellAn ExampleRunning HiveConfiguring HiveExecution enginesLoggingHive ServicesHive clientsThe MetastoreComparison with Traditional DatabasesSchema on Read Versus Schema on WriteUpdates, Transactions, and IndexesSQL-on-Hadoop AlternativesHiveQLData TypesPrimitive typesComplex typesOperators and FunctionsConversionsTablesManaged Tables and External TablesPartitions and BucketsPartitionsBucketsStorage FormatsThe default storage format: Delimited textBinary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFilesUsing a custom SerDe: RegexSerDeStorage handlersImporting DataInsertsMultitable insertCREATE TABLE...AS SELECTAltering TablesDropping TablesQuerying DataSorting and AggregatingMapReduce ScriptsJoinsInner joinsOuter joinsSemi joinsMap joinsSubqueriesViewsUser-Defined FunctionsWriting a UDFWriting a UDAFA more complex UDAFFurther Reading
18. Crunch
An ExampleThe Core Crunch APIPrimitive Operationsunion()parallelDo()groupByKey()combineValues()TypesRecords and tuplesSources and TargetsReading from a sourceWriting to a targetExisting outputsCombined sources and targetsFunctionsSerialization of functionsObject reuseMaterializationPObjectPipeline ExecutionRunning a PipelineAsynchronous executionDebuggingStopping a PipelineInspecting a Crunch PlanIterative AlgorithmsCheckpointing a PipelineCrunch LibrariesFurther Reading
19. Spark
Installing SparkAn ExampleSpark Applications, Jobs, Stages, and TasksA Scala Standalone ApplicationA Java ExampleA Python ExampleResilient Distributed DatasetsCreationTransformations and ActionsAggregation transformationsPersistencePersistence levelsSerializationDataFunctionsShared VariablesBroadcast VariablesAccumulatorsAnatomy of a Spark Job RunJob SubmissionDAG ConstructionTask SchedulingTask ExecutionExecutors and Cluster ManagersSpark on YARNYARN client modeYARN cluster modeFurther Reading
20. HBase
HBasicsBackdropConceptsWhirlwind Tour of the Data ModelRegionsLockingImplementationHBase in operationInstallationTest DriveClientsJavaMapReduceREST and ThriftBuilding an Online Query ApplicationSchema DesignLoading DataLoad distributionBulk loadOnline QueriesStation queriesObservation queriesHBase Versus RDBMSSuccessful ServiceHBasePraxisHDFSUIMetricsCountersFurther Reading
21. ZooKeeper
Installing and Running ZooKeeperAn ExampleGroup Membership in ZooKeeperCreating the GroupJoining a GroupListing Members in a GroupZooKeeper command-line toolsDeleting a GroupThe ZooKeeper ServiceData ModelEphemeral znodesSequence numbersWatchesOperationsMultiupdateAPIsWatch triggersACLsImplementationConsistencySessionsTimeStatesBuilding Applications with ZooKeeperA Configuration ServiceThe Resilient ZooKeeper ApplicationInterruptedExceptionKeeperExceptionState exceptionsRecoverable exceptionsUnrecoverable exceptionsA reliable configuration serviceA Lock ServiceThe herd effectRecoverable exceptionsUnrecoverable exceptionsImplementationMore Distributed Data Structures and ProtocolsBookKeeper and HedwigZooKeeper in ProductionResilience and PerformanceConfigurationFurther Reading
V. Case Studies
22. Composable Data at Cerner
From CPUs to Semantic IntegrationEnter Apache CrunchBuilding a Complete PictureIntegrating Healthcare DataComposability over FrameworksMoving Forward
23. Biological Data Science: Saving Lives with Software
The Structure of DNAThe Genetic Code: Turning DNA Letters into ProteinsThinking of DNA as Source CodeThe Human Genome Project and Reference GenomesSequencing and Aligning DNAADAM, A Scalable Genome Analysis PlatformLiterate programming with the Avro interface description language (IDL)Column-oriented access with ParquetA simple example: k-mer counting using Spark and ADAMFrom Personalized Ads to Personalized MedicineJoin In
24. Cascading
Fields, Tuples, and PipesOperationsTaps, Schemes, and FlowsCascading in PracticeFlexibilityHadoop and Cascading at ShareThisSummary
A. Installing Apache Hadoop
PrerequisitesInstallationConfigurationStandalone ModePseudodistributed ModeConfiguring SSHFormatting the HDFS filesystemStarting and stopping the daemonsCreating a user directoryFully Distributed Mode
B. Cloudera’s Distribution Including Apache Hadoop
C. Preparing the NCDC Weather Data
D. The Old and New Java MapReduce APIs
Index
Colophon
Copyright

Content preview from Hadoop: The Definitive Guide, 4th Edition

Chapter 13. Parquet

Apache Parquet is a columnar storage format that can efficiently store nested data.

Columnar formats are attractive since they enable greater efficiency, in terms of both file size and query performance. File sizes are usually smaller than row-oriented equivalents since in a columnar format the values from one column are stored next to each other, which usually allows a very efficient encoding. A column storing a timestamp, for example, can be encoded by storing the first value and the differences between subsequent values (which tend to be small due to temporal locality: records from around the same time are stored next to each other). Query performance is improved too since a query engine can skip over columns that are not needed to answer a query. (This idea is illustrated in Figure 5-4.) This chapter looks at Parquet in more depth, but there are other columnar formats that work with Hadoop—notably ORCFile (Optimized Record Columnar File), which is a part of the Hive project.

A key strength of Parquet is its ability to store data that has a deeply nested structure in true columnar fashion. This is important since schemas with several levels of nesting are common in real-world systems. Parquet uses a novel technique for storing nested structures in a flat columnar format with little overhead, which was introduced by Google engineers in the Dremel paper.^[86] The result is that even nested fields can be read independently of other fields, resulting in significant ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hadoop: The Definitive Guide, 3rd Edition

Publisher Resources

ISBN: 9781491901687Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hadoop: The Definitive Guide, 4th Edition

by Tom White

Chapter 13. Parquet

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.