find -exec

The -exec flag to find causes find to execute the given command once per file matched, and it will place the name of the file wherever you put the {} placeholder. The command must end with a semicolon, which has to be escaped from the shell, either as \; or as ";". In the following script, every file found has its md5sum taken and stored in a temporary file. This is achieved by this find -exec command in the script:

find "${DIR}" $SIZE -type f -exec md5sum {} \; | sort > $MD5

The $SIZE variable optionally adds -size +0 to the flags passed to find, as there is not a lot of point in taking the md5sum of a bunch of zero-length files; The md5sum of an empty file is always d41d8cd98f00b204e9800998ecf8427e.

See the uniq section in Chapter 13 for a detailed explanation of how uniq filters the results. In short, -w32 tells it to look only at the checksums, and -d tells it to ignore truly unique lines, as they could not represent duplicate files.

The problem of efficiently locating duplicate files is not as simple as it might at first sound. With potentially gigabytes or terabytes of data, it is not efficient to use diff to compare each file against all of the other files. By taking a checksum of each file first, the up-front cost is relatively high, but then by using sort and uniq to good effect, the set of possible matches is quite easily and quickly obtained. The -c

Get Shell Scripting: Expert Recipes for Linux, Bash, and More now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.