In the previous chapter, we looked at the core strength of the Spark framework and the process to use it in different ways. This chapter focuses on how we can use PySpark to handle data. In essence, we would apply the same steps when dealing with a huge set of data points; but for demonstration purposes, we will consider a relatively small sample of data. As we know, data ingestion, cleaning, and processing are supercritical steps for any type of data pipeline before data can be used for Machine Learning ...
2. Manage Data with PySpark
Get Machine Learning with PySpark: With Natural Language Processing and Recommender Systems now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.