Chapter 7. Machine Learning: Logistic Regression in Spark and BigQuery
In Chapter 6, we created a model based on two variables—distance and departure delay—to predict the probability that a flight will be more than 15 minutes late. We found that we could get a finer-grained decision if we used a second variable (distance) instead of using just one variable (departure delay).
Why not use all the variables in the dataset? Or at least many more of them? In particular, I’d like to use the
TAXI_OUT variable—if it is too high, the flight was stuck on the runway waiting for the airport tower to allow the plane to take off, and so the flight is likely to be delayed. Our approach in Chapter 6 was quite limiting in terms of being able to incorporate additional variables. As we add variables, we would need to continue slicing the dataset into smaller and smaller bins. We would then find that many of our bins would contain very few samples, and the resulting decision surfaces would not be well behaved. Remember that after we binned the data by distance, we found that the departure delay decision boundary was quite well behaved—departure delays above a certain threshold were associated with the flight not arriving on time. Our simplification of the Bayesian classification surface to a simple threshold that varied by bin would not have been possible if the decision boundary had been noisier.1 The more variables we use, the more bins we will have, and this good behavior will begin to break down. ...