CHAPTER 2Data and Storage

Modern computing is a data-driven undertaking, with global enterprises accumulating petabytes of data that need to be stored and queried to extract valuable knowledge. Before the advent of big data technologies, we used file systems or relational databases for this purpose, but with datasets now in the petabytes scale and beyond, relational database management systems and traditional file systems cannot scale. Distributed file systems, NoSQL databases, and object storage systems are used today to handle these large datasets.

In this chapter, we will go through important concepts for data and storage. Since the main focus of this book is on data processing, we will keep our discussion limited to cover the most important topics.

Storage Systems

Storage systems are the platforms on which modern enterprises are built. They contain petabytes of data serving applications running nonstop. We have seen many types of storage systems over the years developed to support specific use cases and the demands of various applications. Because of this, we must consider many factors when choosing storage systems. In the era of large data, the first factor we need to take into account is how much data we are going to handle along with cost versus performance considerations.

There are numerous storage media available, including tape drives, hard disks, solid-state drives (SSDs), and nonvolatile media (NVM). Hard disks can be in disk arrays in RAID configurations to provide ...

Get Foundations of Data Intensive Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.