Summary

In this chapter, we have introduced you to the Hadoop ecosystem, including the architecture, HDFS, and PySpark. After this introduction, we started setting up your local Spark instance, and after sharing variables across cluster nodes, we went through data processing in Spark using both RDDs and DataFrames.

Later on in this chapter, we learned about machine learning with Spark, which included reading a dataset, training a learner, the power of the machine learning pipeline, cross-validation, and even testing what we learned with an example dataset.

This concludes our journey around the essentials in data science with Python, and the next chapter is just an appendix to refresh and strengthen your Python foundations. In conclusion, ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.