Chapter 3. The Data Sets

This chapter will introduce three data sets, how to load and prepare each of them, and some initial analysis. Later chapters will then cover each of the four main supervised machine-learning algorithms that H2O supports (random forest, gradient boosting machines, generalized linear models, and deep learning),1 and we will try each algorithm on each of these data sets.

The data sets have been chosen to try and introduce something new each time. The first is a regression, the second is a multinomial classification, and the third is flexible but will be used as a binomial classification. The first tests our green credentials, as we try to predict which house designs will be more energy efficient. The second is a well-studied problem in the field of computer vision, trying to recognize hand-written digits. The third is a sports statistics data set, a time series where we will try to predict future events, specifically who will win a football match. All three data sets will fit in the memory of a typical PC, so you will be able to follow along without needing to rent a cluster.

The third data set was compiled for this book, so we spend more time looking at it here, including the process of dealing with messy data. (Even though this takes us away from the core theme of the book, using H2O, at times.)

Get Practical Machine Learning with H2O now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.