Top Ten Data Crunching Tips and Tricks
by Greg Wilson, author of Data Crunching: Solve Everyday Problems Using Java, Python, and More06/09/2005
Every day, all over the world, programmers have to recycle legacy data, translate from one vendor's proprietary format into another's, check configuration files, and yank data out of web server logs. This kind of programming is usually called data crunching, and while it's not glamorous, knowing how to do it with the least amount of effort can make the difference between meeting a deadline and making another pot of coffee. These ten tips will take the headache out of crunching data.
1. Master the Classic Tools
In an age of 3D window managers and web-based everything, it's easy to forget
that the Unix command-line model of pipes, filters, and redirection is still
the best way to solve most data crunching problems. Almost everything you'll
ever want to do to text data has been done before, so it pays to master the
more powerful tools like cut, grep, sed, find,
and xargs.
Similarly, you should learn how to use a programmable cross-platform editor, such as Emacs or Vim, since many simple crunching tasks can be solved simply by recording macros and playing them back. If you're working in a pure-Microsoft environment, learn enough Visual Basic to script the Visual Studio editor. (The easiest way to do this is to record a few simple operations, then look at the code Visual Studio has written for you.)
One thing I don't think it's worth learning any longer is advanced shell programming. When you have serious crunching to do, you're probably going to want to do at least part of it in Python, Ruby, or some other freely available scripting language, so you might as well start there in the first place.
2. Separate Input, Processing, and Output
You'll occasionally be tempted to solve simple data crunching problems with one-off scripts that transform each record as it's read. Don't give in---experience shows that "one-offs" usually hang around longer than their authors expect, and that simple transformations usually become more complex over time. Instead, always write data crunchers in three parts: one that reads the input into memory, a second that transforms the data, and a third that writes out the data structures. Programs structured this way are a lot simpler to debug, and their parts are easier to modify and reuse.
And while we're on the subject of reuse: every software project's version control repository should have a misc or tools directory, and every little cruncher you write should be saved there. If nothing else, the odds and ends you put there can serve as starting points for the next script you have to write.
3. Store Format Information in the Data Itself
Programs come and go, but data lives forever. Any time you're working with data that isn't in a well-documented standard format, you should therefore try to include format information in the data itself, so that it can't ever be lost. If you're working with binary data, store the format template for each record at the start of the data; for example, if you're archiving a database table whose rows contain an 80-character string, two integers, and a variable-length string, you might store this:
680c2iS
before the first record. The 6 tells whoever's reading the data that the
format specifier is six characters long; those six characters tell the reader
to look for 80 characters, two integers, and a null-terminated block of text.
|
Related Reading
Data Crunching |
If you're making up a new text-based format, allow for headers (like the meta tags
in an HTML page's head section), or put format information in
specially structured comments, so that whoever has to read your legacy files
ten years from now will know how to parse them.
4. Understand International Character Sets
Do you speak Thai? Russian? Arabic? Thanks to the internet, the odds are good that at least a few of your users do, but if you mangle their data, they may not be your users for long. It's therefore essential for your data crunchers to handle international character sets correctly. In particular:
- Make sure you know how data has been encoded; you can't read even the simplest text file unless you know whether it's in UTF-8, UCS-4, or some legacy format. And if you get to choose a format, always choose UTF-8: it encodes the original 7-bit ASCII characters using the values 0-127, just like ASCII did, which improves the odds that your output won't confuse tools (and programmers) who aren't yet thinking about these issues.
- Don't assume that one character equals one byte (or two bytes, or four).
Many character set encodings (particularly the most common, UTF-8) use a
variable number of bytes for each character; one side effect of this is that
strlen()and other functions from the standard C library (and everything that depends upon them) can no longer be trusted. - When writing regular expressions, use character blocks, like
\p{InCyrillic}, and character categories, like\p{Lu}(meaning "lower-case letters"), since these will work no matter what language your text is in, or how it has been encoded.
5. Use Reluctant Matches and Graphical Tools
While we're on the topic of regular expressions, suppose you're trying to extract currency amounts from an email archive. Your first attempt uses the following regular expression:
^(.*)(\d+\.\d{2})(.*)$
which matches:
- The start of the line
- Zero or more characters
- One or more digits, followed by a literal
., followed by exactly two digits - Zero or more characters
- The end of the line
The problem is, when you apply this pattern to the text:
Charge $229.95, and get a receipt.
the middle group only matches 9.95. The reason is that regular
expressions are greedy: each part of the pattern tries to match as
much text as it can, leaving as little as possible for subsequent pattern elements.
Most regular expression (RE) packages now support reluctant (also called generous)
patterns, which match as little text as possible. Each of the operators ?, *,
and + has a reluctant counterpart: ??, *?,
and +?. Making the first group in the pattern reluctant:
^(.*?)(\d+\.\d{2})(.*)$
solves our problem.
As well as illustrating reluctant matches, this example also shows why regular expressions can be so hard to debug: adding one character completely changes the way the RE behaves. This is why I'm very fond of graphical tools for building and debugging REs, such as Edi Weitz's Regex Coach, JRegexpTester, and the Rx Toolkit in ActiveState's Komodo IDE.
6. Nest and Subtract to Negate
Suppose you have a database that records which programmers are assigned to which projects, and you want a list of everyone who isn't working on version 2.1 of Frib. You might try this:
SELECT Assigned.EmployeeId
FROM Assigned, ProjectInfo
WHERE Assigned.ProjectId = ProjectInfo.ProjectId
AND ProjectInfo.ProjectName != "Frib 2.1";
However, if Alan Turing is working on both Frib 2.1 and Snab 3.0, his name will appear in this query's results.
The right way to tackle this problem is to use a nested query to select the records you don't want, and then subtract them from the set of all possible records. It feels clumsy the first few times you do it, but---oh, who am I kidding? It always feels clumsy, but it's what you have to do.
Here's the complete query:
SELECT Assigned.EmployeeId
FROM Assigned
WHERE Assigned.EmployeeId NOT IN
(SELECT Assigned.EmployeeId
FROM Assigned, ProjectInfo
WHERE Assigned.ProjectId = ProjectInfo.ProjectId
AND ProjectInfo.ProjectName = "Frib 2.1");
The nested query finds all employees who are working on Frib 2.1; the outer query then removes those IDs from the set of all employee IDs.
Depending on which database manager you're using, it may help to turn the
nested query into a view. A view acts like a stored SELECT statement
whose result can be treated as a temporary table. In this case, we could use:
CREATE VIEW WorkingOnFrob AS
SELECT Assigned.EmployeeId AS EmployeeId
FROM Assigned, ProjectInfo
WHERE Assigned.ProjectId = ProjectInfo.ProjectId
AND ProjectInfo.ProjectName = "Frib 2.1";
SELECT Assigned.EmployeeId
FROM Assigned, WorkingOnFrob
WHERE Assigned.EmployeeId NOT IN WorkingOnFrob.EmployeeId;
7. Check for Holes
In the real world, data is never complete: there are always people who don't
have a fax number or whose blood type is unknown. As a result, the data you're
crunching may have holes. If you're lucky, these will be marked explicitly---many
relational databases, for example, use NULL to indicate missing
values. If you're unlucky, you'll have to figure out when an empty string or
-1 means what it says, and when it means that whoever created the data didn't
know what else to put.
It's particularly important to keep holes in mind when combining data. For example, if you've been asked to calculate the average sales price for a set of books, and some of them were given away as part of a promotional deal, you'll have to decide whether to count them as $0.00, or leave them out entirely.
The SQL standard tries to help programmers handle missing data by specifying
that any operation involving NULL returns NULL as
a result; i.e., 5 + NULL is NULL, as is (X < 0) OR (Y > 100) if
either X or Y is NULL. Unfortunately,
not all databases implement this correctly: in some, TRUE OR NULL is TRUE,
while FALSE AND NULL is FALSE. Having been
bitten by this a couple of times, I now check for records containing NULL before
doing any other crunching.
8. Use XSLT (with Caveats)
XSLT is an easy way to specify simple transformations, but only when your input is highly structured (that is, where you don't need any information about context in order to interpret particular elements). However, the less structured your documents are--that is, the easier they are for human beings to type in and read---the more complex your transformation rules have to be, and experience shows that the difficulty of debugging XSLT transformations grows very quickly.
For example, transforming this:
<b1><t>XSLT is a simple way to solve simple problems</t>
<b2><t>But complex scripts are fiendishly difficult to debug</t></b2>
</b1>
<b1><t>Options:</t>
<b2><t>Put more structural information in your input.</t>
<b3><t>But this makes it harder for people to create and read.</t></b3>
</b2>
<b2><t>Use a different tool.</t></b2>
</b1>
is almost trivial, since the text for each bullet point is wrapped in <t>...</t>.
Take that away, so that whoever's creating the data has less typing to do:
<b1>XSLT is a simple way to solve simple problems
<b2>But complex scripts are fiendishly difficult to debug</b2>
</b1>
<b1>Options:
<b2>Put more structural information in your input.
<b3>But this makes it harder for people to create and read.</b3>
</b2>
<b2>Use a different tool.</b2>
</b1>
and the XSLT transformation nearly triples in complexity.
9. Use XPath Without Using XSLT
XPath is a companion standard to XSLT, so you can use it without XSLT. XPath
lets you identity parts of an XML document using a notation similar to that
used for paths and filenames in the shell. For example, /project/* means, "All
children of project elements," while ticket[@status='open']/author means, "Authors
of tickets whose status attribute has the value 'open'." Several
modern XML libraries, such as Fredrik Lundh's ElementTree package
for Python, include XPath; data crunchers that use it are a lot easier to read
(and write) than ones that walk trees themselves, and can sometimes be faster,
too.
10. Check Your Work
If you don't know how to check your work, it's almost certainly wrong. Like everything you write, data crunchers should be tested. Sometimes, this is as simple as running it on the data you actually want to transform, and eyeballing the results; in more complex situations, you may want to select some subsets of your actual data, and make sure that each is handled correctly.
The most important thing, however, is figuring out how to tell the right answer from a wrong one. If you don't know enough about your input and output to be able to look at a specific case and say, "Wait a second, that's not right," then the odds are good that you don't know enough about your problem to write the transformation correctly. And please--when you do figure it out, write it down somewhere, so that the next time someone has to wrestle with the problem, it'll take them less time to figure it out.
Greg Wilson has been crunching data for more than 20 years. He is an independent programming consultant, an adjunct professor at the University of Toronto, and a contributing editor with Doctor Dobb's Journal. He is the author of Data Crunching and Practical Parallel Programming.
Return to ONLamp.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 11 of 11.
-
On the tools to learn
2005-08-09 00:04:12 Paddy3118 [Reply | View]
I do most of my datatransformation/crunching work on Unix.
I do beleive you should learn the tools available but I would change the list:
I have not found much use for cut over the years
Because I'm that old, I first learned AWK and
still use it extensively, even though I also
know perl.
If your writting out data then try and write it
in a way suitable for an AWK program to read it.
I've always found that AWK-able data formats are
easy to parse in perl, python, tcl, bash, and C.
(This assumes that you can set the data output
format.
A bit of context: I'm in the Electronic Design Automation field, where a lot of tools come with in-built tcl interpreters, data is often in indented lisp-like lists, AWK-able tables, and machine generated programming code in Verilog or VHDL.
- Paddy.
-
Data crunching? Crunching? Naw...
2005-06-16 06:02:14 ScarletKnight [Reply | View]
... I'm a former mainframe programmer. You don't crunch anything close to data here. Those tips are definitely less than useful. Simple SQL tips for data crunching? Give me a break. I thought I was going to read about how to optimize your transforms for speed without losing maintainability, or even some references to offload techniques to stay out of the database when it doesn't make sense, i.e. knowing the limits of set-oriented transforms versus proecedural ones.
Thumbs down. -
Data crunching? Crunching? Naw...
2005-06-22 05:46:02 GregWilson [Reply | View]
I'm sorry you think I'm not crunching "anything close to data" here. I routinely use these techniques to do one-time reformatting of gigabyte (and larger) scientific data sets. That may no longer be as impressive as it once was (I can still remember the first time I saw a file that was more than a megabyte long ;-), but as I say a couple of times in the book, my focus is on the things you can do in a few minutes to handle everyday tasks, not on what you'd do to run Google or American Express. -
Data crunching? Crunching? Naw...
2005-06-19 23:33:36 alcabon [Reply | View]
I m not agree with you at all. This article is a clear synthesis about the common techniques of data crunching. I was working a long time as a mainframe programmer (IBM MVS). Mainframe means large company and often high license costs for tools like SAS for example (150.000 euros/year). With SAS, you can also do data crunching (sql + SAS language) but you cannot write a large audience book saying : "you must use SAS because it is a good tool for data crunching". With unix, large company can also pay the license for Informatica for example (I used it also during months) for EDI especially but the cost is also prohibitive for small companies. I am working now in an simple unix-windows intranet environment (without SAS, Syncsort or Informatica) and Greg Wilson thinks straigth. I had started to use computer with a ZX81 16ko and 3,5 Hz now I ve got a Pentium 4-1024 Mo RAM and 120 go disk. Using memory is completely different in this condition. For offload techniques, it is a DBA problem related to your RDBMS. With Cloudscape (free and 2 Mo size footprint), I succeded import (csv files) and export with modifications of formats (using sql requests and functions) the data of millions of rows under some minutes using my simple PC. It is very easy to do. The most important for this work of data crunching with RDBMS is to write the straight sql code.
Best regards.
-
In reply to "read the input into memory" considered harmful
2005-06-14 08:55:44 jtaylor247 [Reply | View]
What Greg suggested does not neccessitate a script loading all records into memory first. If you have three functions read(), transform(), and write() you can simply apply these in a read() transform() write() loop that loops once per record leaving only one record in memory at a time.
-
"read the input into memory" considered harmful
2005-06-14 08:28:21 johnsaalwaechter [Reply | View]
I don't agree with point #2, especially the advice to read the input into memory. I work in an environment with large data sets, and I've seen many instances where a script that reads the input into memory is developed on small test data, then unleashed on gigabytes of real data. Either the process exceeds its 4GB address space (most perl executables are still 32-bit), or the server itself runs out of memory.
Definitely using strategies other than "suck it all into memory" is required in many environments. -
"read the input into memory" considered harmful
2005-06-22 05:49:40 GregWilson [Reply | View]
I agree, out-of-core algorithms that don't pull everything into memory at once are absolutely necessary in a lot of cases. However, the focus of the book was on automating odds-and-ends tasks, like pulling sales ranking data off Amazon and finding peaks and valleys. (Gosh, why would I be doing that...? ;-) If you can process your data record by record, that's great; if you can't, and your data won't fit in core, then what you have is a real programming task, rather than a one-off throwaway script.
-
non-greedy regular expressions is the common term
2005-06-09 21:51:35 wixton [Reply | View]
not reluctant, or generous. regular expressions were popularized most by perl, and in that community and others) the terminology is non-greedy regular expressions. a search shows it's mostly java heads who are using this alternate terminology. -
non-greedy regular expressions is the common term
2005-06-22 05:52:42 GregWilson [Reply | View]
Yes, but (a) there are more Java heads out there than Perl heads (even today), and (b) I believe the term "reluctant" pre-dates the first version of Perl by about twenty years. And let's face it --- if I used the term "non-greedy", some Java head would have complained (and some Lisp programmer would have pointed out that "non-greedy" is ambiguous, since there are N different ways to define pattern matching that aren't strictly greedy, all of which are implemented by the following set of mutually-recursive functions...)






<t>...</t>. Take that away, so that whoever's creating the data has less typing to do...and the XSLT transformation nearly triples in complexity."I don't see how. This:
<xsl:template match="text()">
is not three times as complex as this:
<xsl:template match="t">
In my experience, XSLT is not "fiendishly difficult to debug" at all, as long as you understand its processing model and follow good practices. It's certainly harder to hack something together in XSLT if you don't know what you're doing than it is in Perl. But if you have scaled its (nontrivial) learning curve, it's a very powerful and expressive language. And its freedom from side-effects makes your transforms remarkably stable and reliable.
Just don't try to use XSLT to transform data that is only in XML because some clown decided to jump on a bandwagon. I'm thinking here of Microsoft's incredibly unusable Excel XML format. Or formats I've seen where someone takes newline-terminated text data and turns it into XML by sticking
<record>at the start of every line and</record>at the end. Oh, that's useful.