Predicting the chances of infant survival with ML

In this section, we will use the portion of the dataset from the previous chapter to present the ideas of PySpark ML.

Note

If you have not yet downloaded the data while reading the previous chapter, it can be accessed here: http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz.

In this section, we will, once again, attempt to predict the chances of the survival of an infant.

Loading the data

First, we load the data with the help of the following code:

import pyspark.sql.types as typ labels = [ ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()), ('BIRTH_PLACE', typ.StringType()), ('MOTHER_AGE_YEARS', typ.IntegerType()), ('FATHER_COMBINED_AGE', typ.IntegerType()), ('CIG_BEFORE', typ.IntegerType()), ...

Get Learning PySpark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.