Introducing Apache Spark

Apache Spark is an evolution of Hadoop and has become very popular in the last few years. In contrast to Hadoop, and its Java and batch-focused design, Spark is able to produce iterative algorithms in a fast and easy way. Furthermore, it has a very rich suite of APIs for multiple programming languages, and natively supports many different types of data processing (machine learning, streaming, graph analysis, SQL, and so on).

Apache Spark is a cluster framework designed for the quick and general-purpose processing of big data. One of the improvements in speed results from the fact that the data, after every job, is kept in-memory and not stored on the filesystem (unless you want to do so) as would have happened with ...

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.