Data Munging with Hadoop

Released November 2015

Publisher(s): Addison-Wesley Professional

ISBN: 9780134435534

Start your free trial

Book description

The Example-Rich, Hands-On Guide to Data Munging with Apache Hadoop^TM

Data scientists spend much of their time “munging” data: handling day-to-day tasks such as data cleansing, normalization, aggregation, sampling, and transformation. These tasks are both critical and surprisingly interesting. Most important, they deepen your understanding of your data’s structure and limitations: crucial insight for improving accuracy and mitigating risk in any analytical project.

Now, two leading Hortonworks data scientists, Ofer Mendelevitch and Casey Stella, bring together powerful, practical insights for effective Hadoop-based data munging of large datasets. Drawing on extensive experience with advanced analytics, the authors offer realistic examples that address the common issues you’re most likely to face. They describe each task in detail, presenting example code based on widely used tools such as Pig, Hive, and Spark.

This concise, hands-on eBook is valuable for every data scientist, data engineer, and architect who wants to master data munging: not just in theory, but in practice with the field’s #1 platform–Hadoop.

Coverage includes

A framework for understanding the various types of data quality checks, including cell-based rules, distribution validation, and outlier analysis

Assessing tradeoffs in common approaches to imputing missing values

Implementing quality checks with Pig or Hive UDFs

Transforming raw data into “feature matrix” format for machine learning algorithms

Choosing features and instances

Implementing text features via “bag-of-words” and NLP techniques

Handling time-series data via frequency- or time-domain methods

Manipulating feature values to prepare for modeling

Data Munging with Hadoop is part of a larger, forthcoming work entitled Data Science Using Hadoop. To be notified when the larger work is available, register your purchase of Data Munging with Hadoop at informit.com/register and check the box “I would like to hear from InformIT and its family of brands about products and special offers.”

Product information

Title: Data Munging with Hadoop
Author(s):
Release date: November 2015
Publisher(s): Addison-Wesley Professional
ISBN: 9780134435534

book

Hadoop Operations

by Eric Sammer

If you’ve been asked to maintain large and complex Hadoop clusters, this book is a must. …

book

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

by Vijay Srinivas Agneeswaran Ph.D

Master alternative Big Data technologies that can do what Hadoop can't: real-time analytics and iterative machine …

book

Hadoop: Data Processing and Modelling

by Garry Turkington, Tanmay Deshpande, Sandeep Karanth

Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across …

book

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

by Douglas Eadline

Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x …

Data Munging with Hadoop

Book description

Table of contents

Product information

You might also like

Hadoop Operations

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives

Hadoop: Data Processing and Modelling

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly