They say it takes a village to raise a child. Developing data-intensive applications is no different. Coming up with an algorithm that can utilize the abundance of data is just one part of the ecosystem needed. In this chapter, we start with a simple data analytics example to dive into the anatomy of data-intensive applications. Then we look at the steps to scaling an algorithm so it can utilize distributed resources in parallel. We end the chapter with a discussion on some of the challenges faced when developing large-scale systems.
Anatomy of a Data-Intensive Application
Data-intensive applications are designed to run over distributed resources, such as on a cluster of computers or a cloud. Also, these applications are developed based on supporting frameworks both to reduce the development time and to keep the application logic simple without having to deal with the intricacies of distributed systems. These frameworks provide the necessary components to create and execute data-intensive applications at scale. When an application written using a framework is deployed to a cluster, the framework and the application become a single distributed program. Therefore, we will consider the framework as part of the application for our discussion. To understand what is under the hood of a data-intensive application, let us try to write one from the ground up.
A Histogram Example
Imagine we have many CSV files with data about users. For our purposes, ...