Chapter 8. Bad Characters
How often do you think about letters? Numbers? Punctuation? Whitespace? Character encoding? Character sets? Collation sequences? The differences between CHAR
and VARCHAR
and NVARCHAR
and TEXT
? If you are used to working with textual data, then maybe a lot. However, many people, even those working every day with SQL or other programming languages, don’t think about such things until forced to. Let’s force ourselves to, a little, because it matters to fuzzy data matching, especially the incoming data.
Data Representations
There are many ways you can receive the incoming data, but typically if it is from an external entity, it is probably going to end up coming in as follows:
- A “flat file”
-
A comma-separated values (CSV) file or other simple delimited format like tab or
|
delimiters (all very common, but CSV is by far the most common) - Excel
-
Not as frequent
- XML
-
Especially from vendors
- JSON
-
Ditto
- Proprietary or standardized formats
-
HL7 for health, M13 for crop insurance, and so on
If you’re lucky, you will get a schema or doc describing the fields, their formats, the acceptable data ranges, etc. Often you just have to crack open the file and look at the column names and the data. For our purposes we’ll presume we are receiving a CSV file.
How that CSV file gets into your database server so you can compare it against your customer data is another story. Are you responsible for doing it, or is it someone else with the appropriate privileges? If the ...
Get Fuzzy Data Matching with SQL now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.