Chapter 7. Data deduplication in Tivoli Storage Manager V6.1 147
listed with the file. Example 7-1 shows a pair of example hashes for a given file: The two hash
functions are MD5 and SHA-1 in this case. MD5 produces a 128-bit digest (hash value),
where SHA1 produces a 160-bit value. This is shown by the different digest lengths (notice
that the 05e4.. is shorter than the 0d14.. value).
Example 7-1 Hash values of two commonly used functions against the same file
# csum -h MD5 tivoli.tsm.devices.
05e43d5f73dbb5beb1bf8d370143c2a6 tivoli.tsm.devices.
# csum -h SHA1 tivoli.tsm.devices.
0d14d884e9bf81dd536a9bea71276f1a9800a90f tivoli.tsm.devices.
A typical method of deduplication is to logically separate the data in a store into manageable
chunks, then produce a hash value for each chunk, and store those hash values in a table.
When new data is taken in (ingested) into the store, the table is then compared with the hash
value of each new chunk coming in, and where there’s a match, only a small pointer to the
first copy of the chunk is truly stored as opposed to the new data itself.
Typical chunk sizes could be anywhere in the range of 2 KB to 4 MB, although theoretically
any chunk size could be used. There is a trade-off to be made with chunk size: a smaller
chunk size means a larger hash table, so if we use a chunk which is too small, the size of the
table of hash pointers will be large, and could outweigh the space saved by deduplication. A
larger chunk size means that in order to gain savings, the data must have larger sections of
repeating patterns, so while the hash-pointer table will be small, the deduplication will find
fewer matches.
The hashes used in deduplication are similar to those used for security products; MD5 and
SHA-1 are both commonly used cryptographic hash algorithms, and both are used in
deduplication products, along with other more specialist customized algorithms.
With any hash, there is a possibility of a collision, which is the situation when two chunks with
different data happen to have the same hash value. This possibility is extremely remote: in
fact the chance of this happening is less likely than the undetected, unrecovered hardware
error rate.
Other methods exist in the deduplication technology area which are not hash based, so do
not have any logical possibility of collisions. One such method is called hyperfactor; this is
implemented in the IBM ProtecTIER® storage system.
7.1.2 Deduplication ratios
Vendors often quote “average” deduplication ratios as a guide to the space savings available.
In common with compression, deduplication engines are only as good as the type of data fed
into them—so your mileage might vary.
The classic example of a deduplication engines prowess is with data like that contained in an
e-mail system. If we send an uncompressed 1 MB attachment to 100 people, the copies
would take up 100 MB on the e-mail server, plus the 1 MB “sent” copy in our sent folder. The
e-mail server would need 101 MB of free space for us to send that e-mail. When we come to
back up the e-mail server, we would separately back up all 101 copies as though unrelated,
using 101 MB of space.
If we were doing this with a deduplicating data store as the target, we would probably
consume less than 1 MB on deduplicated storage, depending on how deduplicatable the
original 1 MB attachment was. If we assume that it had 50% repeating patterns inside, the

Get Tivoli Storage Manager V6.1 Technical Guide now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.