From the previous section, we've learned an interesting lesson: for big data, always use SGD-based learners because they are faster, and they do scale.
Now, in this section, let's consider this regression dataset:
X_train matrix is composed of 200 million elements, and may not completely fit in memory (on a machine with 4 GB RAM); the testing set is composed of 10,000 observations.
Let's first create the datasets, and print the memory footprint of the biggest one:
In: # Let's generate a 1M dataset X_train, X_test, y_train, y_test = generate_dataset(2000000, 10000, 100, 10.0) print("Size of X_train is [GB]:", X_train.size * X_train[0,0].itemsize/1E9) ...