Scaling the ML pipelines

Data mining and machine learning algorithms impose outstanding challenges on parallel and distributed computing platforms. Furthermore, parallelizing the machine learning algorithms is highly task-specific and often depends on the preceding questions. In Chapter 1, Introduction to Data Analytics with Spark, we discussed and showed how to deploy the same machine learning application on top of a cluster or cloud computing infrastructure (that is, Amazon AWS/EC2).

Following that method, we can handle datasets with enormous batch sizes or in real time. In addition to this, scaling up the machine learning applications evolves another trade-off such as cost, complexity, run-time, and technical requirements. Furthermore, making ...

Get Large Scale Machine Learning with Spark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.