book

Data Analytics with Hadoop

by Benjamin Bengfort, Jenny Kim

June 2016

Intermediate to advanced

286 pages

8h 9m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What to Expect from This BookWho This Book Is ForHow to Read This BookOverview of ChaptersProgramming and Code ExamplesGitHub RepositoryExecuting Distributed JobsPermissions and CitationFeedback and How to Contact UsSafari® Books OnlineHow to Contact UsAcknowledgments
What Is a Data Product?Building Data Products at Scale with HadoopLeveraging Large DatasetsHadoop for Data ProductsThe Data Science Pipeline and the Hadoop EcosystemBig Data WorkflowsConclusion
Basic ConceptsHadoop ArchitectureA Hadoop ClusterHDFSYARNWorking with a Distributed File SystemBasic File System OperationsFile Permissions in HDFSOther HDFS InterfacesWorking with Distributed ComputationMapReduce: A Functional Programming ModelMapReduce: Implemented on a ClusterBeyond a Map and Reduce: Job ChainingSubmitting a MapReduce Job to YARNConclusion
Hadoop StreamingComputing on CSV Data with StreamingExecuting Streaming JobsA Framework for MapReduce with PythonCounting BigramsOther FrameworksAdvanced MapReduceCombinersPartitionersJob ChainingConclusion
Spark BasicsThe Spark StackResilient Distributed DatasetsProgramming with RDDsInteractive Spark Using PySparkWriting Spark ApplicationsVisualizing Airline Delays with SparkConclusion
Computing with KeysCompound KeysKeyspace PatternsPairs versus StripesDesign PatternsSummarizationIndexingFilteringToward Last-Mile AnalyticsFitting a ModelValidating ModelsConclusion
Structured Data Queries with HiveThe Hive Command-Line Interface (CLI)Hive Query Language (HQL)Data Analysis with HiveHBaseNoSQL and Column-Oriented DatabasesReal-Time Analytics with HBaseConclusion
Importing Relational Data with SqoopImporting from MySQL to HDFSImporting from MySQL to HiveImporting from MySQL to HBaseIngesting Streaming Data with FlumeFlume Data FlowsIngesting Product Impression Data with FlumeConclusion

PigPig LatinData TypesRelational OperatorsUser-Defined FunctionsWrapping UpSpark’s Higher-Level APIsSpark SQLDataFramesConclusion
Scalable Machine Learning with SparkCollaborative FilteringClassificationClusteringConclusion
Data Product LifecycleData LakesData IngestionComputational Data StoresMachine Learning LifecycleConclusion
Quick StartSetting Up LinuxCreating a Hadoop UserConfiguring SSHInstalling JavaDisabling IPv6Installing HadoopUnpackingEnvironmentHadoop ConfigurationFormatting the NamenodeStarting HadoopRestarting Hadoop
Packaged Hadoop DistributionsSelf-Installation of Apache Hadoop Ecosystem ProductsBasic Installation and Configuration StepsSqoop-Specific ConfigurationsHive-Specific ConfigurationHBase-Specific ConfigurationsInstalling Spark

Content preview from Data Analytics with Hadoop

Chapter 6. Data Mining and Warehousing

As data analysts, we often prefer to focus on the task of mining data for meaningful insights or applying predictive modeling methods on data that has already been curated, cleaned, and staged for our analysis. However, in most traditional enterprise data environments, there is a tremendous amount of engineering and technical resources that go into funneling and organizing this data into a unified data warehouse before any meaningful data analysis can happen.

The enterprise data warehouse (EDW) has thus become the linchpin in most organizations that process and analyze data at scale. However, because the overwhelming majority of EDWs utilize some form of relational database management system (RDBMS) as the primary storage and querying engine, much of the effort in setting up new data analysis projects is spent on up-front schema design and extract, transform, and load (ETL) operations. It’s estimated that ETL consumes 70–80% of data warehousing costs, risks, and implementation time.¹ This overhead makes it costly to perform even modest levels of data analysis prototyping or exploratory analysis.

RDBMSs present another limitation in the face of the rapidly expanding diversity of data types that we need to store and analyze, which can be unstructured (emails, multimedia files) or semi-structured (clickstream data) in nature. The velocity and variety of this data often demands the ability to evolve the schema in a “just-in-time” manner, which ...