Chapter 9. Improving Predictions

Now that we have deployed working models predicting flight delays, it is time to “make believe” that our prediction has proven useful based on user feedback, and further that the prediction is valuable enough that prediction quality is important. In this case, it is time to iteratively improve the quality of our prediction. If a prediction is valuable enough, this becomes a full-time job for one or more people.

In this chapter we will tune our Spark ML classifier and also do additional feature engineering to improve prediction quality. In doing so, we will show you how to iteratively improve predictions.

Code examples for this chapter are available at Agile_Data_Code_2/ch09. Clone the repository and follow along!

git clone

Fixing Our Prediction Problem

At this point we realized that our model was always predicting one class, no matter the input. We began by investigating that in a Jupyter notebook at ch09/Debugging Prediction Problems.ipynb.

The notebook itself is very long, and we tried many things to fix our model. It turned out we had made a mistake. We were using OneHotEncoder on top of the output of StringIndexerModel when we were encoding our nominal/categorical string features. This is how you should encode features for models other than decision trees, but it turns out that for decision tree models, you are supposed to take the string indexes from StringIndexerModel and directly compose them with ...

Get Agile Data Science 2.0 now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.