Chapter 4. Profiling

Overview of Profiling

We have decided to begin our discussion of data wrangling actions with profiling. This is the first action that people generally undertake when beginning each stage of a data project. Why? Because you need to understand the contents of your data before you can begin transforming or analyzing that data. Fundamentally, profiling guides data transformations.

When you’re working on a data project, you often don’t have time to look at every field of every record. Profiling is the activity that helps you know what is “in” your dataset, and allows you to validate that your transformations work as intended. Often, profiling is used to assess the quality of your data. Profiling is also a crucial aid for data transformation. You frequently need to be able to quickly determine if any records contain data that might cause problems during the transformation process. For example, if your downstream analysis expects each record in a price column to contain numbers, you don’t want to have a record that includes letters or special characters.

Profiling can encompass two slightly different views:

  • Examining individual values in your dataset

  • Examining a summary view across multiple values in your dataset

Each of these views can often be consumed as textual information: a list of data values, a table of summary statistics, and so on. You can also build visualizations to capture profiling information about your data.

Ultimately, individual values profiling boils down to determining the validity of individual record field values. This type of profiling comes in two forms: syntactic checks and semantic checks. Syntactic constraints focus on formatting, and semantic constraints are rooted in context. Set-based profiling attempts to determine the validity of groups or distributions of values in a particular record field.

Beginning in this chapter, we will be working through an example data wrangling project using publicly available US campaign finance disclosure files. We will be wrangling candidate data and individual contribution data from the 2016 presidential election. The goal of this project is to see if there are any trends in the campaign contributions received by each of the two major candidates, Hillary Clinton and Donald Trump.

In each section, we reference the specific file that we are using so that you can follow the discussion in your data wrangling tool of choice. You can download all of the data files from http://www.fec.gov/finance/disclosure/ftpdet.shtml.

Individual Value Profiling: Syntactic Profiling

Syntax refers to constraints on the literal values that are valid in a field. The set of valid syntactic values might be quite small and best represented in a list; for example, Boolean values encoded as bits, {0, 1}, or birth sex encoded as English words, {male, female}. More often, syntactic constraints represent fairly large sets of possible values. For example, you might have a field that corresponds to the number of ATM cash withdrawals a customer has made in her lifetime (as a customer of your bank) in which the permissible values are integers ranging from 0 to 50000 (where the upper limit is determined by the permissible number of ATM cash withdrawals per day multiplied by the number of days your bank has been in business).

Profiling for syntactic constraints involves simply checking whether data values are in (or not in) the set of permissible values. At the specific example levels, a good choice for understanding syntactic type profiling is a random subset of values that satisfy the syntactic constraints along with a random subset of values that do not satisfy the syntactic constraints. From these examples, you often can make reasonable decisions as to whether a different syntactic type might be a better fit for a set of data.

Individual Value Profiling: Semantic Profiling

Semantic type constraints correspond to the meaning or interpretation of field values: values are valid if their interpretations satisfy the constraints. For example, suppose that a dataset has a field corresponding to the biological age, in years, of a customer. For some customers, suppose that we don’t have a reported age, and the age field for the records corresponding to these customers contains the value -1. Although syntactically not a valid age in years, semantically this value can be interpreted as a Boolean indicator for whether or not the customer reported their age. Deriving a new field corresponding to reported age might be the variable we need to analyze how customers’ willingness to report information predicts overall satisfaction.

As in the preceding example, the semantic interpretation of a field value might override syntactic constraints. As another example, consider a field encoding hometown/city. Suppose that some of the field values contain the strings “San Francisco, CA USA” and “Sn Francisco, CA, USA” and “San Francicso, CA, USA.” While some of these are syntactically invalid, they are semantically clear and could be converted to their syntactically correct versions. Alternatively, suppose that the field values contain the string “Moscow.” Should that be interpreted as Moscow, Russia, or Moscow, ID, USA? Perhaps there are other fields in the records that could disambiguate between these cases, but on their own, these values would be semantically invalid.

In other cases, the semantic interpretation of a field might involve a simple conversion from one categorical space to another; for example, age in years to basic life stages (e.g., teenager, young adult, senior), or time of day to basic day stage (e.g., morning, afternoon, evening, and night).

Profiling semantic type constraints often requires deriving a new record field that explicitly encodes the semantic interpretation of a source field. This explicit encoding can then be syntactically typed, where validity can be determined by testing whether the value is in the valid set. Thus, as with syntactic types, the most appropriate summary statistics correspond to the percentage of field values that are valid, invalid, and empty/null.

The common, and more straightforward, case uses deterministic rules for converting a source field to its interpreted value. For example, you might define a rule stating that anyone between the ages of 13 and 19, inclusively, is a teenager. The more difficult case involves nondeterministic, or probabilistic, mappings from source values to interpretations. For example, based on summary demographics that we know about our dataset, we might interpret “Moscow” as 80 percent likely to represent Moscow, Russia, and 20 percent likely to represent Moscow, ID, USA.

Set-Based Profiling

Set-based profiling focuses on the shape and extent of the distribution of values found within a single record field or in the range of relationships between multiple record fields. For numeric fields, distributional profiling often builds from a simple histogram of the set of values and might involve comparing that histogram against a known distribution like a Poisson or Gaussian probability distribution.

In addition to looking at the overall distribution, it helps to look at various summary statistics like minimum, maximum, mean, and sum. These values can provide you with an understanding of the distribution of values across your dataset and help you to immediately identify any problematic distributions or outliers.

For categorical record fields, a couple of different distributional profiles are useful. The first counts occurrences of unique values. Another useful profiling chart for categorical variables clusters the raw values, via a mechanism similar to standardization, and then counts the number of values associated with each cluster.

For more specific types of record fields, there are more specific profiling charts. For example, for geospatial data like zip codes or latitude-longitude coordinates, plotting the data on a map is appropriate.

For date-time data, it is useful to see the values plotted on a variety of scales. You could examine the distribution of date-time values across the 24-hour day, across the 7 days of the week, or across the 12 months of the year.

In addition to distributional profiling of the set of values from a single record field, cross-distributional profiling of values from multiple record fields is often useful in assessing the quality of the overall dataset. Simple scatterplots are commonly utilized.

Profiling Individual Values in the Candidate Master File

Note

To follow along yourself, download the “Candidate Master File” from election year 2015-2016.

You also might need to reference the data dictionary for this file to understand the permissible values in each field. You can find the data dictionary online.

As noted in the introduction to this chapter, we will be attempting to wrangle US campaign finance data to build a refined dataset that will help answer the question: “How were individual campaign contributions distributed between the two 2016 major party US presidential candidates?”

Profiling the available datasets is the first step toward constructing a refined dataset that maps individual campaign contributions to each 2016 presidential candidate. The profiling step is particularly important for this project because the data is largely unknown. Data projects that use publicly available data often require a fairly lengthy profiling process to familiarize yourself with the contents of the dataset.

To help the public understand the contents of these datasets, the US Federal Election Commission (FEC) has provided a data dictionary to supplement the raw candidate and contributions files. We can use the FEC’s data dictionary to inform our profiling of individual values. If you examine the data dictionary, you can see the constraints that define permissible values for each column in the dataset.

Note

Data dictionaries are frequently generated as the output of the metadata-producing actions during the raw data stage. Data dictionaries define the permissible valid values in each column and can also explain the contents of the dataset.

We’re going to begin by looking at the “Candidate Master File” for election year 2015–2016. In case you’re not familiar with US election rules, this file contains information about all of the candidates who filed the appropriate form to be included in an upcoming election. It also includes candidates with active election committees, even if the election that they have registered for is not the current election. As you examine this file, you might also notice some creative candidate names; even though these candidates submitted paperwork to be included in an election, they might not have actually appeared on individual state ballots.

We want to use profiling to answer two specific questions:

  • Are there any values in columns that are syntactically invalid given the scope of our project?

  • Is the range of values in each column valid given the scope of our project?

Syntactic Profiling in the Candidate Master File

Let’s begin by profiling the two US state columns. We want to assess whether the values contained in those columns are syntactically valid. In the FEC’s data dictionary, column 5 represents the state of the office for which the candidate is running, and column 14 represents the state of the candidate’s mailing address.

Focusing on column 5, we can begin by collecting a list of all the unique values in the column. This is a common first step when performing syntactic profiling. This operation produces a list of 57 unique values. We know that there are 50 US states that have voting representatives in the US Congress, and 5 US territories and the District of Columbia with nonvoting representatives in the US Congress. Additionally, if we look at the data dictionary, we can see that column 5 can contain a nonstate value, “US,” in records that represent candidates who are running for president. So, at first glance, it seems reasonable that there would be 57 possible syntactically valid locations in column 5.

We can dig a little deeper and examine each value in column 5 individually to see if it matches one of the known 50 state abbreviations, 6 territory abbreviations, or “US.” We performed this check by using a lookup to a reference dataset of all 57 valid values. In fields that matched one of the 57 valid values, we inserted a “1,” and in fields that did not match one of the 57 valid values, we inserted a “0.” Ultimately, all the values in this column are syntactically valid; in our Boolean indicator column, 100 percent of the records contained “1.”

Because column 14 also contains state abbreviations, we can perform a similar set of profiling checks on this column. Again, a count of all the distinct values in this column reveals that there are 57 possible values. However, because this column represents a mailing address, there are only 56 possible valid values: 50 US state abbreviations, “DC” for the District of Columbia, and 5 US territory abbreviations. At first glance, we can assume that at least some of the records contain syntactically invalid entries in column 14. In addition to 57 distinct values, this column also contains missing values. We can consider missing values syntactically invalid because the data dictionary does not indicate that missing values should appear in this column.

We’ll use the same procedure that we applied when profiling column 5 to see which of the individual values in column 14 are syntactically valid. Performing a lookup to a reference table and generating a Boolean indicator column shows that there is a single record that contains an erroneous state: “ZZ.”

You can perform a similar set of syntactic checks on the other columns in the Candidate Master File. We recommend generating a series of Boolean indicator columns to show whether the values in each record are permissible given the constraints defined in the FEC’s data dictionary.

Set-Based Profiling in the Candidate Master File

Let’s profile the distribution of values in column 4 of the Candidate Master File. According to the data dictionary, this column represents the year of the election for which each candidate registered. Since this dataset can include candidates for any election with active campaign committees, we would expect to see the years distributed so that there are relatively few records for elections prior to 2016, a large number of records for the 2016 election year, and possibly a small number of records that represent future elections (perhaps 2018 or 2020).

After you’ve generated a summary view that counts the number of records that occur in each year, you should see a very wide range of values in column 4. The earliest recorded date is 1990; the date farthest in the future is 2064.

At this point, we would recommend stepping back to determine the utility of records in this column. If you remember our discussion in Chapter 2, assessing the utility of your data involves generating custom metadata, or metadata specific to your use case. That means that we should assess the distribution of the values in column 4 in the context of our specific project to see how many of these records are relevant to our analysis. The goal of this project is to see if there are any trends in the campaign contributions received by each of the two major candidates in the 2016 presidential election, Hillary Clinton and Donald Trump. Since we’re interested in only the 2016 presidential election. records that represent candidates registered for the elections in 1990 or 2064 are ultimately irrelevant to our task. We can insert additional metadata into our dataset at this stage, perhaps flagging records that contain a value other than “2016” in column 4 as invalid.

You can profile the values in column 6 of the Candidate Master File in a similar way. This column represents each candidate’s desired office. Based on our project’s constraints, attempt to make an assessment about the utility of each category of values in this column.

Get Principles of Data Wrangling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.