We often talk in business and computing about moving from “raw data” to “knowledge,” hoping to take useful actions based on the data our organization has collected over time. Before one can view trends in your data or do other analytics, you need tools for data cleaning and for combining multiple data sources into meaningful collections of information, known as entities. An entity may be a customer, a product, a point of sale, an incident being investigated by the police, or anything else around which you want to build meaningful context.

In this post, we’ll explore some of the complexities in real-life data that create headaches — and how analytical software can help users prepare data for sophisticated queries and visualizations.

Data cleaning

No matter what the data’s source and how carefully the data was collected, it usually ends up being incredibly inconsistent. Besides routine typing errors or minor issues such as whether someone’s name is capitalized MacIntyre or Macintyre, people can go so far as to enter data in the wrong fields. For instance, suppose you find letters in a phone number field. Does this mean someone typed in something meant for another field? Or is it a legitimate phone number that happens to be spelled out as letters, such as 1-800-FLOWERS?

Well-established rules exist for cleaning various types of data, but before jumping in and applying a rule — for instance, deleting any phone number that contains letters — analytics software should bring in the user to make a decision. It should offer statistics on the data, such as what percentage looks like a phone number and what percentage has other data or is empty. The user can then decide which rule to apply. The user might also notice that a whole field was shifted during translation, so that all names are in the phone number field. In such cases, the user can simply rename the field in the output.

Rules can also be applied with greater or lesser strictness. One data analysis engine, Novetta Entity Analytics, approaches this problem by letting its users determine a threshold value, or degree of strictness. A higher threshold value will produce fewer errors (false positives) but risks eliminating some good data (false negatives). A lower threshold value has the reverse effect. The user has to decide how to balance the two risks.

For instance, if you’re serving up ads, you might be willing to accept a less-strict threshold and serve up ads that don’t interest users. If you’re transferring large sums of money, you’ll probably opt for a strict threshold so you don’t give somebody money unintentionally.

The ideal threshold value is the dividing point: above this value, an individual entity can be reliably identified, while below this value, an individual is no longer unique. A dashboard shows the user both the curve and records excluded, allowing the user to understand the impact of picking different values on the graph. You can choose a threshold value by looking at how many records would be excluded if a value point was set for a threshold at that particular rule, or for different choices on the curve — a measure some tools call the uniqueness value.

Data cleaning usually can’t depend on one field in isolation. For instance, if one data set lists Bob Girofsky and another lists Robert Girofsky, it may be worth checking whether they are associated with the same phone number or address in order to do a match. Over time, rules have been accumulated for making fuzzy matches, such as recognizing common nicknames like Bob.

Similarly, if a person has the phone number 123-4567 in one database and 213-4567 in another, you can guess that it’s the same person but that someone switched two digits by mistake when typing one number in. In short, fuzzy matching works together with comparisons that include multiple fields in a record.

Other rules can check for bogus data. Someone downloading a white paper might enter Mickey Mouse into his name field simply to satisfy the program accepting the form. Rules can look for common tricks like that and eliminate bad data. Software can also account for cultural differences to help data users avoid making mistakes based on a lack of cultural understanding. For instance, Abcde looks like a fake name unless you know that in some communities it is fairly popular (it is pronounced “absedee”).

Multiple fields may also have to be combined just because data was stored in a structured format. A person’s address might be stored in five fields called Street, City, State, Country, and ZIP Code. A phone number could be split into Country, Area Code, and Local fields.

Combining multiple data sources

You don’t want to just blindly slurp huge tables into your analytics, so tradeoffs must always be made when choosing which fields to combine into entities. For instance, if you are tracking sales in order to plan shipments, you might just record the point of sale. But if you want to check how efficient individual cashiers are, you need the ID of the cashier at that point of sale.

Usually, data that helps you understand an entity will come from different tables or spreadsheets. Relational data is especially prone to being broken up, because the process of normalization usually separates a customer from each individual purchase she makes, and separates the purchase from the information about the products she bought. To understand a customer, you probably want to know the dates of recent purchases and the products bought, so you could easily pull data from at least three different tables into your Customer entity.

When joining databases (sometimes hundreds of them) it helps to keep track of the data source each row came from, and the primary key within the data source. For instance, suppose a user decided not to request credit card information when pulling customers into a table, but wants to check whether one customer has a credit card, after doing the analysis. Because the table has the data source and primary key, the original database can easily be queried to pull out the credit card number. This lets you trace every record and attribute back to its original source — a useful trait because sometimes information gets lost in translation.

This post is part of a collaboration between O’Reilly and Novetta Solutions. See our statement of editorial independence.

For more on this topic, check out the free archived webcast, Entity Resolution on Hadoop: The Pitfalls of Building it Yourself, presented by Dave Moore, a solutions architect for commercial markets at Novetta Solutions, and Jenn Reed, director of product management at Novetta Entity Analytics.