Data and feature preparation

Everyone who has worked with open data will agree that a huge amount of time is needed to clean datasets, with a lot of work to be completed to take care of data accuracy and data incompleteness.

Also, one main task is to merge all the datasets together, as we have separate datasets for crime, education, resource usage, request demand, and transportation from the open datasets. We also have datasets from some separate sources, including census.

In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we reviewed a few methods for feature extraction and discussed their implementation on Apache Spark. All the techniques discussed there can be applied to our data here.

Besides data merging, we will ...

Get Apache Spark Machine Learning Blueprints now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.