Feature preparation

In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we reviewed a few methods for feature extraction as well as their implementation on Apache Spark. All the techniques discussed there can be applied to our datasets here, especially the ones of utilizing time series to create new features.

As mentioned earlier, for this project, we have a target categorical variable of student attrition and a lot of data on demographics, behavior, performance, as well as interventions. The demographic data is almost ready to be used but needs to be merged with the following table for a partial list of the features:

FEATURE NAME	Description
`ACT`	These are the average ACT scores
`AGE`	This is the age
`UNEMPLOYMENT ...`

Get Apache Spark Machine Learning Blueprints now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Spark Machine Learning Blueprints by Alex Liu

Feature preparation

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly