Skip to Content
Data Analytics with Hadoop
book

Data Analytics with Hadoop

by Benjamin Bengfort, Jenny Kim
June 2016
Intermediate to advanced
286 pages
8h 9m
English
O'Reilly Media, Inc.
Content preview from Data Analytics with Hadoop

Chapter 6. Data Mining and Warehousing

As data analysts, we often prefer to focus on the task of mining data for meaningful insights or applying predictive modeling methods on data that has already been curated, cleaned, and staged for our analysis. However, in most traditional enterprise data environments, there is a tremendous amount of engineering and technical resources that go into funneling and organizing this data into a unified data warehouse before any meaningful data analysis can happen.

The enterprise data warehouse (EDW) has thus become the linchpin in most organizations that process and analyze data at scale. However, because the overwhelming majority of EDWs utilize some form of relational database management system (RDBMS) as the primary storage and querying engine, much of the effort in setting up new data analysis projects is spent on up-front schema design and extract, transform, and load (ETL) operations. It’s estimated that ETL consumes 70–80% of data warehousing costs, risks, and implementation time.1 This overhead makes it costly to perform even modest levels of data analysis prototyping or exploratory analysis.

RDBMSs present another limitation in the face of the rapidly expanding diversity of data types that we need to store and analyze, which can be unstructured (emails, multimedia files) or semi-structured (clickstream data) in nature. The velocity and variety of this data often demands the ability to evolve the schema in a “just-in-time” manner, which ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Big Data Analytics with Hadoop 3

Big Data Analytics with Hadoop 3

Sridhar Alla
Hadoop Fundamentals for Data Scientists

Hadoop Fundamentals for Data Scientists

Jenny Kim, Benjamin Bengfort
Data Science on AWS

Data Science on AWS

Chris Fregly, Antje Barth

Publisher Resources

ISBN: 9781491913734Supplemental ContentErrata Page