July 2017
Beginner to intermediate
378 pages
10h 26m
English
Designing distributed computing analytics requires that you think what can be run in parallel and what has to be run one step after another. Running computations in parallel is where a lot of the speed advantage comes from in cluster computing systems such as Spark. But it does require a little different thinking.
Think in terms of how to split up an analytics job into actions that can be run either record by record or on a small subset of records, without needing to know what is going on elsewhere in the full dataset. A simple example is a word count exercise.
Imagine you have millions of rows of survey results and need to analyze the survey question: "Comment about the insightfulness ...