Full Text Comparison
An alternative approach is to compare messages using their entire content, taking into account the insertion and deletion of words and changes in spelling and punctuation. This lets you use all the information content of the text, rather than a single word or phrase, and it allows you to avoid having to define a specific pattern that may not work as well as you had hoped.
Text comparison in this general sense is not a simple problem.
Simple tools such as grep
or diff
are not up to the task. Tools based on
dynamic programming, which I discuss briefly in Chapter 8 in the context of
uncovering plagiarism, are too computationally expensive to be used
here. Fortunately, there are a variety of open source text search tools
available that can be used. Most of these operate by indexing the
significant words in each document and then efficiently comparing those
indexes. This approach, in its basic form, treats each word separately,
whereas a lot of information is contained in how words are arranged in
sentences. In the case of email searches, this is not such an important
factor. Some of the leading tools in this area include WebGlimpse
(http://webglimpse.net/), Swish-e (http://swish-e.org/) and Lucene (http://lucene.apache.org/). Efficient text
comparison is a major component of Internet search engines, and, not
surprisingly, these open source tools tend to focus on that
application.
Rather than show how one of these tools can be adapted for email searching, I have chosen ...
Get Internet Forensics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.