Data and feature preparation

Everyone who has worked with open data will agree that a huge amount of time is needed to clean datasets, with a lot of work to be completed to take care of data accuracy and data incompleteness.

Also, one main task is to merge all the datasets together, as we have separate datasets for crime, education, resource usage, request demand, and transportation from the open datasets. We also have datasets from some separate sources, including census.

In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we reviewed a few methods for feature extraction and discussed their implementation on Apache Spark. All the techniques discussed there can be applied to our data here.

Besides data merging, we will ...

Get Apache Spark Machine Learning Blueprints now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Spark Machine Learning Blueprints by Alex Liu

Data and feature preparation

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly