O'Reilly logo

Practical Machine Learning with H2O by Darren Cook

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 2. Data Import, Data Export

There was a joke going around, recently, that went like this:

In data science, 80 percent of time is spent in preparing data, 20 percent of time is spent complaining about the need to prepare data.1

Sad, but true. H2O provides some functions to make the process a bit easier, but ultimately you are still going to be spending a lot of time finding data sets, understanding them, moaning about them, repairing them, importing them, and more moaning about them. However, it won’t be 80% of your time any more… The new 80% is spent tweaking machine learning parameters and drinking tea waiting for your neural nets to overfit. (At least until you read about “Early Stopping” in Chapter 4. And “Grid Search” in Chapter 5.)

This chapter will cover getting data into H2O, manipulating data in H2O, and getting data out of H2O. The skills will be used, in context, in later chapters.

Memory Requirements

For deciding how much memory your cluster needs, in total, to be able to build models and run predictions against the full data set, H2O recommends four times the size of the data. As an example, you have 100 million rows, which is 5GB when zipped on disk, and maybe takes up 10GB in H2O’s memory (it is stored compressed, but not as tightly as a ZIP or GZIP file). So you need about 40GB of memory. If your cluster is made up of machines each with 16GB of memory, you should be looking at using three machines, though you might get away with two.

I will introduce ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required