Chapter 6. Data Mining and Warehousing

As data analysts, we often prefer to focus on the task of mining data for meaningful insights or applying predictive modeling methods on data that has already been curated, cleaned, and staged for our analysis. However, in most traditional enterprise data environments, there is a tremendous amount of engineering and technical resources that go into funneling and organizing this data into a unified data warehouse before any meaningful data analysis can happen.

The enterprise data warehouse (EDW) has thus become the linchpin in most organizations that process and analyze data at scale. However, because the overwhelming majority of EDWs utilize some form of relational database management system (RDBMS) as the primary storage and querying engine, much of the effort in setting up new data analysis projects is spent on up-front schema design and extract, transform, and load (ETL) operations. It’s estimated that ETL consumes 70–80% of data warehousing costs, risks, and implementation time.1 This overhead makes it costly to perform even modest levels of data analysis prototyping or exploratory analysis.

RDBMSs present another limitation in the face of the rapidly expanding diversity of data types that we need to store and analyze, which can be unstructured (emails, multimedia files) or semi-structured (clickstream data) in nature. The velocity and variety of this data often demands the ability to evolve the schema in a “just-in-time” manner, which ...

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.