Chapter 5. Fixing a Broken Large-Scale Query

The use of customized synthetic data as a way to safely work with outsiders when dealing with sensitive, secure data was discussed in Chapter 4, and we mentioned an open source tool, log-synth, as a simple but powerful data generator. In order to demonstrate the effectiveness of this approach, we also introduced two real-world use cases that benefited from log-synth. In the current chapter, we go into much more detail about the implementation of one of those use cases, fixing the problem encountered when an insurance company tried to run a complex Hive query being used with secure data.

In addition to more fully explaining the insurance company use case, we go into details in this chapter to show you how you can use this type of example more generically, beyond this particular sector. You should be able to apply the same approach to related situations in your own projects. It’s not just about one bug in Hive—it’s more importantly about how to use log-synth–generated, simulated data to work in a secure environment with complex queries against large-scale relational data.

A Description of the Problem

In the case of the insurance company, the customer had a query that involved a join of more than 20 tables and included a sub-query as well. This query was simplified from a larger query that had also caused problems. In order to reproduce the problem, we needed to have test data that emulated the structure and scale of the original in order ...

Get Sharing Big Data Safely now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.