There was a joke going around, recently, that went like this:
In data science, 80 percent of time is spent in preparing data, 20 percent of time is spent complaining about the need to prepare data.1
Sad, but true. H2O provides some functions to make the process a bit easier, but ultimately you are still going to be spending a lot of time finding data sets, understanding them, moaning about them, repairing them, importing them, and more moaning about them. However, it won’t be 80% of your time any more… The new 80% is spent tweaking machine learning parameters and drinking tea waiting for your neural nets to overfit. (At least until you read about “Early Stopping” in Chapter 4. And “Grid Search” in Chapter 5.)
This chapter will cover getting data into H2O, manipulating data in H2O, and getting data out of H2O. The skills will be used, in context, in later chapters.
For deciding how much memory your cluster needs, in total, to be able to build models and run predictions against the full data set, H2O recommends four times the size of the data. As an example, you have 100 million rows, which is 5GB when zipped on disk, and maybe takes up 10GB in H2O’s memory (it is stored compressed, but not as tightly as a ZIP or GZIP file). So you need about 40GB of memory. If your cluster is made up of machines each with 16GB of memory, you should be looking at using three machines, though you might get away with two.
I will introduce ...