♣39♣Parallelism for Big Data

Since the 1990s providers such as Terradata specialize in solutions that store data and then allow to operate parallel onmassive amounts of data. CERN1 on the other hand is a textbook example of an institution that handles big datasets, but they rather rely on super computers (high performance computers) instead of resilient distributed computing. In general – and for most applications – there are two solutions: one is resilient distributed computing where data and processing units (PUs) are commodity hardware and redundant, and the second is high performance computing with racks of high grade CPUs.2

One particularly successful solution is breaking up the data in parts and storing each part (with some redundancy built in) on a computer its own CPU and data storage. This allows to bring the calculations to the multiple parts of data instead of bringing all the data to the single CPU.

In 2004 Google published the seminal paper in which they described the process “MapReduce” that allows for a parallel processing model, to process huge amounts of data in parallel nodes. MapReduce will split queries and distributes them over the parallel nodes, where they are processed simultaneously (in parallel). This phase is called the “Map step.” Once a node has obtained its results, it will pass them on to the layer higher. This is the “Reduce step,” where the results are stitched together for the user.

This approach has many advantages:

  • it is ...

Get The Big R-Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.