Chapter 1. Moving to Big Data Analysis
Can you tell by sailing the surface of a lake whether it has been well maintained? Can local fish and plants survive? Dare you swim? And how about the data maintained in your organization’s data lake? Can you tell whether it’s healthy enough to support your business needs?
An increasing number of organizations maintain fast-growing repositories of data, usually from multiple sources and formatted in multiple ways, that are commonly called “data lakes.” They use a variety of storage and processing tools—especially in the Hadoop family—to extract value quickly and inform key organizational decisions.
This report looks at the common needs that modern organizations have for data management and governance. The MapReduce model—introduced in 2004 in a paper1 by Jeffrey Dean and Sanjay Ghemawat—completely overturned the way the computing community approached big data analysis. Many other models, such as Spark, have come since then, creating excitement and seeing eager adoption by organizations of all sizes to solve the problems that relational databases were not suited for. But these technologies bring with them new demands for organizing data and keeping track of what you’ve got.
I take it for granted that you understand the value of undertaking a big data initiative, as well as the value of a framework such as Hadoop, and are in the process of transforming the way you manage your organization’s data. I have interviewed a number of experts in data ...
Get Managing the Data Lake now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.