Chapter 14. THE MapReduce PROGRAMMING MODEL AND IMPLEMENTATIONS
HAI JIN, SHADI IBRAHIM, LI Ql, HAIJUN CAO, SONG WU and XUANHUA SHI
INTRODUCTION
Recently the computing world has been undergoing a significant transformation from the traditional noncentralized distributed system architecture, typified by distributed data and computation on different geographic areas, to a centralized cloud computing architecture, where the computations and data are operated somewhere in the "cloud"—that is, data centers owned and maintained by third party.
The interest in cloud computing has been motivated by many factors [1] such as the low cost of system hardware, the increase in computing power and storage capacity (e.g., the modern data center consists of hundred of thousand of cores and petascale storage), and the massive growth in data size generated by digital media (images/audio/video), Web authoring, scientific instruments, physical simulations, and so on. To this end, still the main challenge in the cloud is how to effectively store, query, analyze, and utilize these immense datasets. The traditional data-intensive system (data to computing paradigm) is not efficient for cloud computing due to the bottleneck of the Internet when transferring large amounts of data to a distant CPU [2]. New paradigms should be adopted, where computing and data resources are co-located, thus minimizing the communication cost and benefiting from the large improvements in IO speeds using local disks, as shown in ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access