Repeatable sampling of data sets in BigQuery for machine learning

An efficient, fast, and repeatable selection method that works on very large data sets.

By Valliappa Lakshmanan

November 14, 2016

Structural repetition. (source: Pixabay)

Doing machine learning on distributed data sets is methodologically similar to working with data that fits in-memory—train your algorithm on a subset of the data, validate on another subset, and finally test with a different subset. In this post, we’ll discuss how to pull data from BigQuery (the no-ops data warehouse that is part of Google Cloud Platform) into machine-learning-ready data sets. We’ll use Airline Ontime Performance data, a 70 million row data set from the U.S. Bureau of Transportation statistics, that is available to all users in BigQuery as the airline_ontime_data.flights data set.

Naively, the easiest way to sample data in BigQuery is to use the built-in random function in a WHERE clause:

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  RAND() < 0.8

The RAND() function returns a value between 0–1, so approximately 80% of the rows in the data set will be selected by this query. However, there are several problems with using this sampling method for machine learning:

You want to create three data sets: training, validation, and testing, and while you got 80% of the data above, it is not nearly as easy to get the 20% that were not selected, let alone split that data into two parts.
The RAND() function returns different things each time it is run, so if you run the query again, you will get a different 80% of rows. It is not repeatable. This can play havoc if you are experimenting with different machine learning models so that you can choose the best one.
The order of rows in a BigQuery result set is not guaranteed—it is essentially the order in which different workers return their results. So, even if you set a random seed to make RAND() repeatable, you’ll still not get repeatable results. You’d have to add an ORDER BY clause to explicitly sort the data (on an id field) before doing the RAND(). This is not always going to be possible.

A repeatable way to split your data set

For machine learning, you want repeatable sampling of the data you have in BigQuery. One way to achieve this is to use the last few digits of the HASH function on the field that you are using to split your data.

Let’s say that you are building a machine learning algorithm to predict arrival delays. You might want to split up your data by date and get approximately 80% of the days in the data set:

SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date)) % 10 < 8

This is now repeatable—because the HASH function returns the same value any time it is invoked on a specific date, you can be sure you will get the same 80% of data each time. If you want to split your data by arrival_airport (so that 80% of airports are in the training data set), compute the HASH on arrival_airport instead of date.

It is also pretty clear what you have to do to get the validation data: change the < 8 in the query above to ==8, and for testing data, change it to == 9. This way, you get 10% of samples in validation and 10% in testing.

Repeatable sampling

This is good if you want 80% of the entire data set as training, but what if you want a smaller data set than what you have in BigQuery to play around with? The flights data is 70 million rows, and perhaps what you want is a small data set of perhaps a million flights. How would you pick one in 70 flights, and then 80% of those as training?

What you can not do is something along the lines of:

SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date))%70 == 0 AND ABS(HASH(date)) % 10 < 8

You can not pick one in 70 rows and then pick one in 10. Can you figure out why? Well, if you are picking numbers that are divisible by 70, of course they are also going to be divisible by 10! That second modulo operation is useless.

Here’s a better solution:

SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date))%70 == 0 AND ABS(HASH(date)) % 700 < 560

Where the 700 is 70*10 and 560 is 70*8. The first modulo operation picks one in 70 rows and the second modulo operation picks 8 in 10 of those rows.

For validation data, you’d replace < 560 by the appropriate range:

SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay,
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date))%70 == 0 AND ABS(HASH(date)) % 700 >= 560 AND ABS(HASH(date)) % 700 < 640

Note: In the code above, your one million flights come from only 1/70th of the days in the data set. This may be precisely what you want—for example, you may be modeling the full spectrum of flights on a particular day when experimenting with the smaller data set. However, if what you want is 1/70th of the flights on any particular day, use RAND() (as shown in the first code sample) and save the result as a new BigQuery table (for repeatability). From this smaller table, you can sample 80% of dates using HASH(). Because this new table is only one million rows and only for experimentation, the duplication may be acceptable.

Happy coding!

Post topics: Data science