January 2019
Beginner to intermediate
154 pages
4h 31m
English
It is difficult to run a machine-learning algorithm when your data is distributed across multiple machines. There might be a case when the calculation depends on another point that is stored or processed on a different executor. Data can be shuffling across executors or workers, but shuffle comes with a heavy cost. Spark provides a way to avoid shuffling data. Yes, it is caching. Spark's ability to keep a large amount of data in memory makes it easy to write machine-learning algorithms.
Spark MLlib and ML are the Spark’s packages to work with machine-learning algorithms. They provide the following:
Read now
Unlock full access