Now that we have a general plan of action, before exploring our data, we must first invest in building the reusable tools for conducting the early mundane parts of the exploration pipeline that help us validate data; then as a second step investigate GDELT's content.

Introducing mask based data profiling

A simple but effective method for quickly exploring new types of data is to make use of mask based data profiling. A mask in this context is a transformation function for a string that generalizes a data item into a feature, that, as a collection of masks, will have a lower cardinality than the original values in the field of study.

When a column of data is summarized into mask frequency counts, a process commonly called data profiling ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.