Making sense of why we need a distributed framework

The easiest way to build a cluster is to use some nodes as storage nodes and others as processing ones. This configuration seems very easy to use as we don't need a complex framework to handle this situation. In fact, many small clusters are built exactly in this way: a couple of servers handle the data (plus their replica) and another bunch process the data. Although this may appear as a great solution, it's not often used for many reasons:

  • It only works for embarrassingly parallel algorithms. If an algorithm requires a common area of memory shared among the processing servers, this approach cannot be used.
  • If one or many storage nodes die, the data is not guaranteed to be consistent. ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.