Chapter 13. Using Data De-Duplication to Lighten the Load

Sometimes people don't exercise or eat right and end up getting fat. A similar situation can happen with data storage. Lazy users who never delete anything make the poor backup administrators' jobs even harder, forcing them to back up useless or duplicate data. Many companies have no formal polices on storing information, so data that the business doesn't really need gets stored, backed up, and even replicated for disaster recovery anyway.

Consider these situations:

  • Users who never delete e-mails or send e-mails with large attachments to large distribution lists

  • Users who store multiple copies of the same file because they're not sure which one holds the right changes

  • Users who download and store MP3 files at work

  • Multiple duplicate copies of executables like winword.exe (the executable file that makes Microsoft Word work) backed up from all the lap-tops and desktops in the company

All these situations conspire together to waste storage space. As a result, many SAN networks store much more data than necessary, which raises costs. This chapter deals with the general concept of data de-duplication: what it is, how it works, where it should be applied, and the results you should expect.

Understanding Data De-Duplication

In simplified terms, data de-duplication means comparing objects (usually, files or blocks) and removing all non-unique or duplicate objects (copies). If you look at the left side of Figure 13-1, you see several blocks ...

Get Storage Area Networks For Dummies® now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.