Chapter 2

Scalability and Cost Evaluation of Incremental Data Processing Using Amazon’s Hadoop Service

Xing Wu, Yan Liu, and Ian Gorton

Abstract

Based on the MapReduce model and Hadoop Distributed File System (HDFS), Hadoop enables the distributed processing of large data sets across clusters with scalability and fault tolerance. Many data-intensive applications involve continuous and incremental updates of data. Understanding the scalability and cost of a Hadoop platform to handle small and independent updates of data sets sheds light on the design of scalable and cost-effective data-intensive applications. In this chapter, we introduce a motivating movie recommendation application implemented in the MapReduce model and deployed on Amazon Elastic ...

Get Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.