Chapter 8. Bad Characters

How often do you think about letters? Numbers? Punctuation? Whitespace? Character encoding? Character sets? Collation sequences? The differences between CHAR and VARCHAR and NVARCHAR and TEXT? If you are used to working with textual data, then maybe a lot. However, many people, even those working every day with SQL or other programming languages, don’t think about such things until forced to. Let’s force ourselves to, a little, because it matters to fuzzy data matching, especially the incoming data.

Data Representations

There are many ways you can receive the incoming data, but typically if it is from an external entity, it is probably going to end up coming in as follows:

A “flat file”

A comma-separated values (CSV) file or other simple delimited format like tab or | delimiters (all very common, but CSV is by far the most common)

Excel

Not as frequent

XML

Especially from vendors

JSON

Ditto

Proprietary or standardized formats

HL7 for health, M13 for crop insurance, and so on

If you’re lucky, you will get a schema or doc describing the fields, their formats, the acceptable data ranges, etc. Often you just have to crack open the file and look at the column names and the data. For our purposes we’ll presume we are receiving a CSV file.

How that CSV file gets into your database server so you can compare it against your customer data is another story. Are you responsible for doing it, or is it someone else with the appropriate privileges? If the ...

Get Fuzzy Data Matching with SQL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.