We can see from the prior summary that we have no missing values; however, we can see that there are quite a few variables with values of zero, which do not make any sense. For example, it is impossible to have a zero reading for blood pressure, but it is OK to have a 0 for the number of months pregnant. So, for most of these variables, we will assume that zero was recorded for NAs and we will map the data accordingly:
- First, we will copy the data to a new dataframe
- Then, we will change the zeros to NAs for the five variables listed in the code:
# we see that there are 0's which are really NA's #some 0's are really NA's, we will change them in Spark # keep pregnant = 0 PimaIndians <- PimaIndiansDiabetes ...