Chapter 1. Risks to Your Data: Why We Back Up

You might think you don’t need this chapter. You know why we back up data, and you don’t need someone convincing you that backing it up is an important thing to do. Nonetheless, there might be a few reasons for backing up that you might not have thought of. These will be helpful in meetings that start with the question, “Why do we spend so much money on backup and disaster recovery?”

Note

No one cares if you can backup. Only if you can restore.

—Mr. Backup

This chapter covers a variety of reasons why we need backup and disaster recovery (DR), listed in the order they are most likely to happen. Let’s start by looking at the most common reason why you’ll find yourself looking for your backup manual: Someone did something wrong. A natural disaster might not be likely in your case, but a human disaster most certainly is.

The One That Got Away

“You mean to tell me that we have absolutely no backups of paris whatsoever?” I will never forget those words. I had been in charge of backups for only two months, and I just knew my career was over. We had moved an Oracle application from one server to another about six weeks earlier, and I missed one crucial part of the move. I knew very little about database backups in those days, and I didn’t realize that I needed to shut down an Oracle database before backing it up. This was accomplished on the old server by a cron job that I never knew existed. I discovered all of this after a disk on the new server went south. (I cover Oracle backup in detail in Chapter 7.)

“Just give us the last full backup,” they said. I started looking through my logs. That’s when I started seeing the errors. “No problem,” I thought, “I’ll just use an older backup.” The older logs didn’t look any better. Frantically, I looked at log after log until I came to one that looked as if it were OK. It was just over six weeks old. When I went to grab that volume, I realized that we had a six-week rotation cycle, and we had overwritten that volume two days ago.

That was it! At that moment, I knew that I’d be looking for another job. This was our purchasing database, and this data loss would amount to approximately two months of lost purchase orders for a multibillion-dollar organization.

So I told my boss the news. That’s when I heard, “You mean to tell me that we have absolutely no backups of paris whatsoever?” Isn’t it amazing how I haven’t forgotten its name? I don’t remember any other system names from that place, but I remember this one. I felt so small that I could have fit inside a 4-mm tape box. Fortunately, a system administrator worked what, at the time, I could only describe as magic. The dead disk was resurrected, and the data was recovered straight from the disk itself. We lost only a few days’ worth of data. Our department had to send a memo to the entire organization, saying that any purchase orders entered in the past two days had to be reentered. I should have framed a copy of that memo to remind me of what can happen if you don’t take this job seriously enough. I didn’t need to, though; its image is permanently etched in my brain.

Some of this book’s reviewers said things like, “That’s pretty bold! You’re writing a book on backups and you start it out with a story about how you messed up. Some authority you are!” Why did I include it? Through all the years, and all the outages, this one sticks in my mind.

Perhaps it’s because it’s the only one that almost got me. Had it not been for the miraculous efforts of a wonderful administrator named Joe Fitzpatrick, my career might have been over before it started. I include this anecdote because:

It’s the one that changed the direction of my career.
I learned several valuable lessons from it, which I discuss in this book.
It could have been avoided if I had had a book like this one.
You must admit that it’s pretty darn scary.

Human Disasters

The majority of restores and disaster recoveries today are executed because of humans doing something, accidentally or on purpose, that damages your computing environment. This can be anything from a simple fat-finger mistake to a terrorist attack that takes out your whole building. Since this activity is so common, it’s what your backup and DR systems should be really good at recovering from.

First, I’ll describe a group of disasters that are caused accidentally, and then I’ll move on to more malicious acts that want to harm your data as well. I’ll then round out this section with how to protect against the threat to your data from the inside.

Accidents

People make mistakes. Sometimes you copy the wrong file into the wrong place and overwrite a good file with a bad file. Sometimes you’re trying to make room for new data and you delete a bunch of files and clear out the trash can—only to realize seconds later that you deleted something you didn’t mean to.

As a person who has been working at a keyboard for most of his life, I can’t tell you the number of times I have fat-fingered something and done an incredible amount of damage with just a few keystrokes. This is what some of us call PEBKAC, as in, “Problem exists between keyboard and chair.”

Often when we think about what we call user error, we tend to think about what our industry calls the end user, or the person or persons who are actually using the systems and databases that the IT department provides. It’s really easy to think that all user error comes from them, not from us. But the reality is that system, network, and database administrators are also quite fallible.

In fact, not only do administrators make mistakes just like everyone else, but their mistakes can have much greater ramifications. I have seen hundreds of examples of this in my career. Here’s just a quick list off the top of my head.

Drop the wrong table in a database.
Format the wrong drive, erasing a perfectly good file system.
Perform a restore of a development database onto a production database.
Write a script designed to delete orphaned home directories but that actually deletes all home directories.
Delete the wrong virtual machine (VM).

Those of us with administrator privileges wield a mighty sword. A slight swing in the wrong direction can be deadly. We are another reason we back up.

That’s Not What I Meant!

I was administering a QA group for a major software organization. When the software is installed, it typically creates an install directory in $HOME/foo, which would be a subdirectory in the home directory of the user installing the software. A QA person was doing the install as root, so it should have created a directory called foo in root’s home directory. Instead, it created an actual directory called /$HOME/foo—literally $HOME was the directory name. The user submitted the bug fix and decided to get rid of the useless directory using the following command:

# rm -rf $HOME

(This was on a standard Unix system, where $HOME for root was still /.)

Once I finally stopped laughing, I used the install media and rebuilt the machine (there was no golden image for that one that I could use to reinstall everything. Nor were there any backups for the various QA servers). Fortunately, most of the critical data was on the network file system (NFS) server.

Bad Code

Bad code can come from anywhere. It can be something as simple as a shell script that gets overzealous and deletes the wrong data, or it could be an actual bug in the core software that silently corrupts data. These stories don’t usually make the news, but they happen all the time.

Perhaps it is an in-house developer who doesn’t follow basic rules, like where to place their code. I remember years ago when an entire development team of a dozen or so in-house developers stored their entire code tree in /tmp on an HP-UX system, where /tmp was stored in RAM. The code was wonderful until someone rebooted the server and /tmp was cleaned out along with everything else in RAM.

Commercial software developed by a professional team of developers is not immune, either. Once again, not only is their software often run at a privileged level and therefore able to do a lot of damage, the same software is being run in hundreds or thousands of organizations. Sometimes a mistake isn’t noticed until it has done quite a bit of damage.

My favorite story to illustrate this is about a version of a particular commercial backup software package whose developer wanted to make it faster to put an electronic label on the front of each tape. There were two complaints about the old process, the first of which was that a tape that was being reused—and therefore being electronically relabeled—required multiple button presses to ensure that you really did mean to overwrite this tape with a new electronic label. The second was that the process only used one tape drive at a time, even if there were several tape drives in the tape library. The latest version of the software introduced what they called the “fast-and-silent” option. This meant it would label as many tapes as you wanted, without prompting you about overwriting them, using as many available tape drives as you had.

This new feature came with a new bug. Previously, when you saw a list of tapes and you double-clicked one tape, it would pull up a dialog box and ask you whether you wanted to overwrite that tape. But now, double-clicking a single tape pulled up a dialog box that listed every tape in the tape library.

Like the combination of user error and bad code that caused me to lose a version of this chapter, a user who thought they were overwriting one tape would find themselves two mouse clicks away from overwriting every tape in the tape library without being prompted—while using every available tape drive to make this process go as quickly as possible.

I was at a customer meeting sitting next to a consultant who was sent by the vendor providing the software, who did just that. He wanted to relabel a single tape in the customer’s tape library, so he double-clicked that tape, probably just like he always did. He saw the new fast-and-silent option and chose it. It was several minutes before he realized what he had done, which was to overwrite every tape in the customer’s tape library, rendering them useless. Yes, the customer had off-site copies of all these tapes, so they didn’t lose any data. No, they were not very understanding.

This is a good place for me to tell you about the 3-2-1 rule, which is: three versions of your data, on two different media, one of which is stored somewhere.

Note

The 3-2-1 rule is the fundamental rule upon which all backups are based. I’ll cover it in more detail in Chapter 3, but you’ll see it mentioned in almost every chapter.

People make mistakes. Sometimes good people do bad things that either delete or corrupt data, and this is yet another reason we back it up .

Malicious Attacks

Now let’s take a look at the second type of data risk: bad people doing bad things. Malicious attacks against your datacenter have sadly become quite common. This is why the true enemies of the modern datacenter are bad actors bent on doing your organization harm via some type of electronic attack. Steel yourself; this can get a bit choppy.

Largest Electronic Attack in History

As I wrote this chapter, the United States was experiencing its largest electronic attack (of which I am aware). At the time, the full extent of the damage, and the timescale involved, was unclear. It seemed that the malware in question was inserted in a Trojan horse provided by a commercial software organization. Malicious code was inserted into the vendor’s software without its knowledge, and then distributed to its entire customer base. We’re not entirely sure how it happened at this point, but two lessons you should take from this are that you literally never know where an attack can come from, and every software update should be validated for security vulnerabilities. You also don’t know how long it might be between the infection and the execution of the attack, so you may need to hold your backups a little longer than you are currently doing.

Terrorism

Someone may choose to target your organization purposefully and cause physical damage in some way. They may attempt to blow up your buildings, set them on fire, fly planes into them, or commit all sorts of physical terrorist actions. This could be for political purposes, such as what happened on 9/11, or for corporate sabotage. The latter is less likely, mind you, but it can and does happen.

Protecting your infrastructure from terrorism is outside the scope of this book. I simply wanted to mention that terrorism is yet another reason why the 3-2-1 rule exists. Get backups of your data and move them far away from the data you are protecting.

Unfortunately on 9/11, several organizations ceased to exist because they failed to follow the 3-2-1 rule. They did have backups. They even had DR hot sites that were ready to take over at a moment’s notice—in the other tower. This is why when we talk about the “1” in the 3-2-1 rule, we mean far away. A synchronously replicated copy of your database sitting on a server a few hundred yards away is not a good DR plan.

Electronic Attacks

The more common event that your organization is likely to experience is an electronic attack of some sort. Although this could be via someone hacking your firewall and opening up a backdoor into your datacenter, it’s more likely that it will be some type of malware that got into your computing environment. That malware then opens the door from the inside.

I watched a speech from a security expert who did live demonstrations on how to hack into organizations. None of them were via exploited firewalls, or anything of the sort. Every single one exploited some human vulnerability to get you to open the backdoor for him. It was honestly quite scary.

Such malware is typically introduced via some type of phishing and/or social engineering mechanism that results in someone in your organization downloading the errant code directly into your computing environment. It could be an email, a hacked website, or even a phone call to the wrong person. (In the previously mentioned speech, one attack vector was a phone charging cable that deployed malware if you plugged it into a computer that enabled data on that USB port.) Once that initial penetration happens, the malware can spread through a variety of means to the entire organization.

Ransomware

The most common malware that is running amok these days is what we call ransomware. Once inside your system, it silently encrypts data and then eventually offers you a decryption key in exchange for a financial sum (i.e., a ransom). This ransom could be a few hundred dollars if you are a single individual or millions of dollars if you are a big organization.

Ransomware attacks are on the rise, and ransom demands have only gotten larger. This trend will continue until every organization figures out the simple response to such an attack: a good DR system.

This is why I discuss ransomware in more detail in Chapter 11. I think ransomware is the number one reason you might actually need to use your DR system. You are much more likely to be struck by a ransomware attack than by a natural disaster or a rogue administrator deleting your data. And the only valid response to such is a DR system with a short recovery time.

Malware and ransomware have been a problem for a while. This is partly because ransomware attacks were limited to those with the technical wherewithal to pull them off, but that is no longer the case. It is changing with the advent of ransomware-as-a-service (RaaS) vendors, who make it much easier to get into the ransomware game.

You specify the target and any information helpful in gaining access to said target, and they execute the attack. These criminal organizations are doing this solely for profit, because they take a significant share of any profit from the attack. They have little interest in other uses for ransomware or theftware.

RaaS is relevant to this discussion because it bolsters my claim that this is becoming a greater danger to your data than anything else. Skilled hackers that are hacking your organization for altruistic or corporate espionage reasons have existed since computers came on the scene, but they were limited in number. For you to be susceptible to such an attack, you would need someone with enough incentive to attack you and enough knowledge of how to do so. The existence of RaaS removes that last requirement. The only knowledge an attacker needs now is how to get on the dark web and contact such a service.

My point is this: In addition to natural disasters becoming more common than ever before because of climate change, RaaS will make ransomware attacks a greater risk to your data with each passing day. They are yet another reason why we perform backups.

External threats such as natural disasters and electronic attacks are not your only problem, though. You must consider the risk from your own personnel. Let’s take a look at the internal threats to your data.

Internal Threats

Many organizations do not properly prepare for attacks that come from within, to their detriment. Information security professionals repeatedly warn that many attacks come from the inside. Even if an attack isn’t initiated from within, it could be enabled from within, such as by an innocent employee accidentally clicking the wrong email attachment.

The most common internal threat is an employee or contractor with privileged access who becomes disgruntled in some way and chooses to harm the organization. The harm may be anything from damaging operations or data simply out of spite, to using their access to facilitate a ransomware attack.

This is typically referred to as the rogue admin problem. A person with privileged access can do quite a bit of harm without anyone knowing. They can even plant code that will do damage after they leave.

I’m thinking of Yung-Hsun Lin, the Unix administrator who installed what was referred to as a “logic bomb” on 70 of his organization’s servers in 2004. The script in question was set to destroy all data on the servers as retaliation if he was laid off. Although his fears of being laid off turned out to be unfounded, he left the logic bomb in place anyway. Luckily it was discovered before it was set to go off and never actually did any damage. He was convicted in 2006.

I’m also thinking of Joe Venzor, whose premeditated attack on his organization’s data resulted in weeks of backlog for a boot manufacturer. Fearing that he might be fired, he put in a backdoor disguised as a printer. He was indeed fired, and immediately activated his malware. It shut down all manufacturing within one hour of his termination.

While researching for this section, I came upon an online discussion in which a snarky person said that what Joe Venzor did wrong was doing things in such a way that allowed him to get caught. What he should have done, this poster said, was to put the attack in place and have it continually check for this person logging in. (He could have used a variety of scheduling tools to accomplish this.) If the person did not log in for more than a month, the attack would be initiated and accomplish whatever the attacker wanted to do to punish the organization.

This illustrates the power that system administrators can have if you allow them to do so. This is why you really must do your best to limit the blast radius of those with privileged access.

Unrestricted access to the administrator account in Windows, or the root account in Unix and Linux, or similar accounts in other systems, can easily allow a rogue administrator to do damage with zero accountability. Do everything you can to limit people’s ability to log in directly as these accounts, including all of the following ideas.

Use named accounts for everything: Everyone should log in as themselves all the time. If you have to use root or administrator privilege, you should log in as yourself and then obtain such privileges via a logged mechanism. Limit or eliminate tools that must be run as root or Administrator.
Don’t give anyone the root password: I have worked in organizations that set the root or administrator password to a random string that no one records. The idea is that if you need administrator or root access, you can always grant yourself that access through systems like sudo or Run as Administrator. This allows you to do your job but records all of your activities.
Delete or disable programs with shell access: This is more of a Unix/Linux problem, but there are many commands (e.g., vi) that support escaping to the shell. If you can run vi as root, you can then run any command as root without it being logged. This is why those concerned about this problem replace vi with a similar command that does not have such access.
Allow superuser login only on the console: Another way to limit unrestricted superuser access is to allow only such access from the console. This is less than ideal, but it is better than nothing. Then make sure that any physical access to the console is logged. This works just as well in a virtual console world where people must log in to the virtual console when accessing.
Off-host logging: Any access to a superuser account (authorized or not) should be logged as a security incident, and any related information (e.g., video surveillance or virtual console logs) should immediately be stored in such a way that a hacker could not remove the evidence. That way, if someone does compromise a system, they can’t easily clean up their tracks.
Limit booting from alternate media: As much as you can, remove the ability to boot the server or VM from alternate media. The reason for this is that if you can boot from alternate media, any Linux or Windows admin can easily boot the server, mount the boot drive, and edit any settings that are in their way.

I’m focusing on things that a data protection person should do to help protect the data. You should also work with an information security professional, which is another discipline entirely, and outside the scope of this book.

Separation of powers

Once you make it the norm to log on as yourself when doing administrative work, the next hurdle to overcome is that too many people have access to all-powerful tools that can do as much harm as good. For example, if an employee has the ability to log in as administrator—even via sudo or the like—they can do a lot of damage before being stopped. If that same person also has access to the backup system, they can hamper or remove the organization’s ability to recover from their actions.

This is why there is a good argument to be made to separate such powers as much as possible. The backup and DR systems should be your last line of defense, and it should be managed by a completely different entity that does not also have the ability to damage the infrastructure it’s protecting.

Role-based administration

A manifestation of the concept of separation of powers can be seen in the role-based administration features of many data protection products. The idea is to have different parts of the data protection system managed by different people who actually have different powers that are defined as roles.

One role might be day-to-day operations, so you can only execute predefined tasks. For example, this role can monitor the success of backup policies and rerun a given backup policy if it fails. What it cannot do is change the definition of that backup in any way. Another role would have the ability to define backup policies but not have the ability to run said policies. Separating backup configuration from backup operations minimizes what one person can do.

A completely different role might be one having to do with restores. In a perfect world, such a role could be limited in such a way that it would only allow data to be restored back to the place it was backed up from. Alternate server and alternate directory restores would require special authorization to prevent this person from using this feature to exfiltrate data from the organization.

Backup products that have the concept of role-based administration built into them have defined these roles already, and you simply need to assign them to different people. My point is simply to suggest that you think long and hard about how to do this. The easy thing to do would be to assign all roles to a single person, just like it’s easier to give everyone the root/admin password; however, the best thing from a security perspective is to give several people a single role within the backup system. They would need to work with another person in the organization to do something outside of their role. Although this does not completely remove the idea of an inside job, it does significantly reduce the chances of one.

Least privilege

Once you have enabled role-based administration, make sure that each person and each process has only the level of access they require to do their job. One example of this is not to grant full admin access to the backup agent you are installing. Find out the lowest level of access it needs to do the job, and use that role or level of access. The same is true of any operators or administrators. They should have only the level of access necessary to accomplish their job—and nothing more.

Multiperson authentication

As long as I’m on the idea of protecting your organization from insider threats, I’d like to introduce a concept not found in most backup products: multiperson authentication. It’s a take on multifactor authentication that requires two people to authenticate a particular activity. This is sometimes referred to as four-eyes authentication, because it requires two sets of eyes. (Although as a person who must wear glasses, this term is not my favorite.) This feature could be enabled on parts of the backup system where security is a concern.

If you’re going to do something nefarious, you might want to delete all previous backups for a given server or application, reduce the retention period for any further backups, and even delete the backup configuration altogether. In most backup environments, you could do all of that without setting off any alarms! The same thing is true of restores. If someone wants to steal your data, they can easily use the backup system to restore all of that data to a location from which they can easily remove it. This is why some products require two-person authentication for restores, changes in a backup policy, or a reduction in the retention period of an existing backup.

Just like everything else discussed in this chapter, two-person authentication is not foolproof. A hacker who has gained access to your communication system could easily intercept and circumvent additional requests for authentication. There is no IT without risks; the job of the data protection system is to reduce those risks as much is possible. That, my friends, is why we back up .

Mechanical or System Failure

When I entered the data protection industry in the early nineties, this was the number one reason we would actually use our backup system for its intended purpose. File systems and databases sat directly on physical hard drives, and when one of those hard drives decided to take a dive, it took its data along with it.

Things are very different today for a variety of reasons, the first of which is that most mission-critical data now sits on solid-state media of some sort. Almost all edge data is also stored on such media, including laptops, smartphones and tablets, and Internet of Things (IoT) devices. The result is that the average IT person today simply has not experienced the level of device failure that we experienced back in the day.

In addition to storage devices becoming more resilient, redundant storage systems such as RAID and erasure coding have become the norm in any datacenter that cares about its data. Disk drive manufacturers also appear to build integrity checking in their devices’ firmware that errs on the conservative side to prevent loss of data better due to a failed disk. This means that a restore is almost never conducted due to the failure of a hard drive, but that is not to say that such things do not happen.

Even in a RAID or erasure coding array that can handle multiple simultaneous disk failures, such failures can happen. Power supplies can still go awry, or some firmware may cause failure on multiple drives. It is incredibly rare for simultaneous disk failure to take out a RAID array, but it is not unheard of. This is why RAID and/or erasure coding does not take away the need for backup. Restores due to such failures are rare, but they do happen.

Power Disruptions

As much as I hate to continue to use where I live as an example, we are currently experiencing what we call rolling blackouts. We are in fire season, and power companies use rolling brownouts to help reduce the possibility of fires.

This is something you can easily design for, and you should be able to survive from a data protection perspective. Any datacenter of a decent size has redundant power and a large generator. You can weather a reasonably large power disruption, assuming you know it’s coming.

What might happen, however, is an unexpected power interruption. If all the power coming into your datacenter simply stops without notice, all your servers will stop working. Data will not be properly saved and some of it could become corrupted. Most structured data should be able to survive due to built-in data integrity features, and most unstructured data should survive, except a few files in the process of being written at the moment of the outage. In addition, whereas a database may survive due to its data integrity features, the media recovery process (covered in Chapter 7) may take longer than a full restore.

There are multiple shoulds in the previous paragraph, which is why we back up. Sometimes databases don’t come back up after a server crashes. Sometimes the file that was being written when the power went out was a really important file that people have been working on for multiple days. This is why we back up.

There Is No Cloud

As big a fan as I am of the public cloud, it’s really just packaging and marketing. There is no such thing as a cloud; there is only someone else’s computer. Yes, they have done all sorts of programming to make the cloud easy to provision and administer and to make it more resilient in general. But the cloud is not magic; it is just a bunch of computers providing you a service. Those computers can still fail, as can the drives within them.

It’s also important to realize that not all storage within the cloud is the same from a data protection standpoint. Although object storage is typically replicated to multiple locations and can therefore survive a number of calamities, most block storage is simply a logical unit number (LUN) on a virtual drive from a single storage array in a single datacenter. It offers no redundancy whatsoever and therefore must be backed up. Redundant block storage is available in the cloud, but most people do not use that option. In addition, as will be discussed later in this chapter, redundancy does not solve issues caused by humans.

System Failure

Whether we are talking about a single server with a single storage array, a metro cluster spanning multiple datacenters and geographic locations, or a service being used in the cloud, nothing is perfect. Programmers make mistakes and bad things happen. This is why we back up any data that actually matters.

In an irony to end all ironies, a programming error combined with user error caused me to lose the first version of this chapter completely. As mentioned in other parts of the book, I am writing this book using Dragon dictation software while walking on a treadmill in the middle of a pandemic. The default setting in Dragon is to save the audio of your dictation automatically, along with the document itself, and it makes saving the document take much longer. Since I had never used the audio, this particular morning I decided to change the setting and tell it to ask me before saving the audio.

I dictated for roughly two hours as I walked on the treadmill, and then suddenly remembered that I had not been saving as I went along, as I normally do. I said, “Click File…Save,” which triggers the saving process. A dialog box popped up that asked me if I wanted to save the document, but for some reason I thought it was asking me if I wanted to save the audio. I responded, “No,” and it proceeded to close the document without saving it. There went two hours of dictation.

I should have paid closer attention when responding to the dialog box, and it shouldn’t have closed the file that I didn’t tell it to close. I simply told Dragon to save it and then told it not to save it; I never told it to close it. My point is that software and hardware might not do what you expect them to do, and this is why we back up. This is a bad example, because backup would not help in this case; the entire document was simply in RAM. Even a continuous data protection (CDP) product would not have helped the situation.

Note

Most of the advice given in this book should apply to any commercial, governmental, or nonprofit entity that uses computers to store data vital to the functioning of that organization. This is why, whenever possible, I will use the word organization when referring to the entity being protected. That could be a government, a nongovernmental organization (NGO), a for-profit private or public company, or a nonprofit company.

System and storage resiliency being what it is these days, data loss due to physical system failure shouldn’t happen too often. However, there is little your organization can do to stop the next threat to your data that we will discuss. If a natural disaster hits your organization, you’d better be ready .

Natural Disasters

Depending on where you live, and the luck your organization has, you may have already experienced one or more natural disasters that tried to take out your data. In fact, due to the incredible level of resiliency of today’s hardware, you are much more likely to experience a natural disaster than a mechanical or system failure.

Natural disasters are one big reason the 3-2-1 rule is so important, and why this will not be the last time I discuss it in this book. It is especially important if you’re going to use your backup system for DR.

The key to surviving a natural disaster is planning and designing your DR system around the types of disasters that may have an impact on your area and, therefore, your data. Let’s take a look at several types of natural disasters to see what I mean. I apologize in advance to my international audience; all my examples will be based in the United States. I’m assuming you have similar natural disasters where you live.

Floods

Floods can take out your datacenter and occur for a variety of reasons. Perhaps your building sprinkler system malfunctions and floods your building with water. Perhaps your roof leaks and there’s a solid downpour; there go your servers. Or perhaps you live in a floodplain, and a giant river overflows and takes your building with it. The result is the same: Computers and water do not mix.

I cut my first backup teeth in Delaware, right next to the Delaware River. We weren’t too concerned about hurricanes, except that one coming up the coast could cause a large storm that could cause a flood in the Delaware River. Our datacenters were on the ground floor, and that was a problem. Combine that with the fact that our off-site media storage facility was in a World War II bunker that was actually underground.

When it comes to floods, high ground is a beautiful thing. Like the other regional disasters mentioned earlier, the key is to make sure that your DR site is completely outside of wherever there might be a flood. Depending on your locale, this actually could be relatively close but on higher ground. Like everything else here, consult an expert on such things before making a decision.

Fires

We have four seasons in California: fire, flood, mud, and drought. It’s a desert climate, and it doesn’t take much to start a wildfire. Once out of control, those wildfires are their own living thing and are nearly impossible to stop. I have evacuated due to fire at least once, because a large wildfire was burning out of control and heading straight for my house. (If you looked at the map of that fire, it was shaped like a giant triangle. The point of that triangle was pointed directly at my house.) Fire got within a few miles of us, but we eventually got lucky.

It might not be a wildfire that takes out your datacenter, though. It could be something as simple as an electrical short in a single box. This is why we have breakers, but they don’t always work. Perhaps someone stores too many oily rags in the wrong place and you get spontaneous combustion. Perhaps someone throws a cigarette out the window and it creates a fire near your building before you stop it. Fires are incredibly damaging to a datacenter. Just as water and computers don’t mix, neither do computers and smoke.

There are a variety of methods outside the scope of backup and DR to survive a datacenter fire. But most likely, if you have one, you’ll be wanting a solid backup and DR plan. Fire is therefore yet another reason we back up.

Earthquakes

I live in southern California, so I am very familiar with earthquakes. For those of you who do not live here, please understand that most earthquakes are incredibly minor and merely feel like you’re sitting on a big vibrating bed for a few seconds. I can’t remember the last time an earthquake in my area was strong enough to knock anything off the shelf, let alone do any real damage. The Northridge quake in Los Angeles (100 miles from me) is the most recent major earthquake in history, and it was in 1994.

The key to surviving an earthquake is preparation. Buildings are built to survive minor earthquakes, and building codes require things inside those buildings to be strapped down. Even within my home, for example, my water heater requires a strap. Datacenter racks are put on shock mounts that allow the rack to move around a little bit if the floor shakes. This probably sounds completely foreign if you don’t live here, but it is a very standard part of building and datacenter design in California.

You also have to think about how much damage an earthquake can do and make sure the DR copy of your data is outside the blast radius of that damage. This is actually not that hard to do, because earthquakes tend to be very localized. Consult an earthquake expert, and they will advise you in this regard.

Hurricanes, Typhoons, and Cyclones

Hurricanes, typhoons, and cyclones are deadly storms forming over water. (Hurricanes are called typhoons in the western North Pacific, and cyclones in the Indian Ocean and South Pacific Ocean.) I grew up in Florida, and I’ve spent a good amount of time on the Gulf Coast of Texas, so hurricanes are also something I know a little bit about. My family members and I have been in the midst of multiple hurricanes and on the fringes of many more. Unlike a typical earthquake, the great thing about a hurricane is that you do get a little bit of advance warning. You also have a solid understanding of the types of damage a hurricane can do, depending on where you live. You may be dealing with storm surge, which causes flooding. You may be dealing with roof or building damage. You simply need to design around these issues with your DR plan.

The real key to surviving a hurricane is to make sure that your DR plan is based on using a system that is completely outside of the path of any potential hurricane. This means not putting your DR site anywhere along the southeast coast of the United States or anywhere along the Gulf Coast. Hurricanes can be incredibly unpredictable and can go anywhere along the coast in those regions. As we will cover in the DR chapter, I think the best option to survive a hurricane is a cloud-based DR system, because it would allow you to have your DR site anywhere else in the country—even somewhere else in the world.

Tornadoes

Tornadoes are deadly swirling windstorms of extremely concentrated power. In the United States, we have something called tornado alley, where tornadoes are a frequent occurrence. It includes parts of nine states, starting with northern Texas and stretching up to South Dakota. For those unfamiliar with a tornado, they are incredibly concentrated events that can completely remove all evidence of one building while leaving the building next door completely untouched. They combine the worst powers of a hurricane with the unpredictability of an earthquake, because a tornado can actually touch down out of nowhere in a matter of seconds.

Like hurricanes, the key to surviving tornadoes is a DR site that is located nowhere in tornado alley. You can perhaps argue that tornadoes are such concentrated events that a DR site in, say, South Dakota can protect a datacenter in, say, Kansas. I just don’t know why, given the choice, you would do that. My personal opinion would be to have it somewhere very far west or east of where you are.

Sinkholes

I have a lot of arguments with my parents over which state has the worst natural disasters: Florida or California. I mention to them that every single year there are hurricanes that hit Florida, but major earthquakes are incredibly rare in California. They counter with the fact that hurricanes give you some advance notice and allow you to prepare, whereas earthquakes just strike without warning. The trump card is sinkholes. Not only do they strike with little to no warning, they can do an incredible amount of damage in a very short period. They combine the surgical-strike nature of tornadoes with the zero-warning aspect of earthquakes. Tornado survivors tell you it sounds like a freight train as it is happening; sinkholes are completely silent—except for the massive damage they cause. Talk about scary!

For those who are unfamiliar with this phenomenon, which is quite common in Florida and not unheard of in other parts of the world, it comes from having a foundation of limestone sitting on top of giant underwater reservoirs. These underground rivers and lakes run throughout Florida and are often tapped as a source of freshwater. If you drain all the freshwater out of a particular reservoir, you create a weak point in the limestone. It’s a house of cards and can instantly take something as small as a bedroom or as big as several city blocks.

I watched a documentary that talked about people’s beds being sucked into sinkholes while they were sleeping. It also mentioned the most infamous sinkhole that happened 15 minutes from where I lived: the Winter Park sinkhole of 1981. One afternoon, several city blocks just sank into the earth hundreds of feet below, never to be seen again, taking a house, the community swimming pool, and a Porsche dealership along with it.

At the risk of repeating myself, make sure your DR site is nowhere near your primary site. Sinkholes are relatively rare from a sinkhole-per-square-mile perspective, but they happen all the time . It’s yet another reason to make sure that you follow the 3-2-1 rule, which I’ll spell out in Chapter 3.

Takeaways

There are countless reasons to back up our important data and ensure that it is protected from damage. The first reason is that what has not been backed up cannot be restored. Remember, they don’t care if you can back up; they only care if you can restore.

The world has not been kind to those who do not back up. Natural disasters, terrorists, hackers, and simple accidents by our own staff are all on the list. In addition, as reliable as compute and storage have become over the past few decades, neither reliability nor resilience stops a hurricane or a hacker. All that resilient hardware protects you from is hardware failure, not things that attack the data itself or blow up the entire datacenter.

In short, backup, recovery, and DR are now more important and more complex than ever before. This is why it’s more important than ever to get our requirements right before designing a data protection system, and that’s why the next chapter is about doing just that.

Get Modern Data Protection now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Modern Data Protection by W. Curtis Preston

Chapter 1. Risks to Your Data: Why We Back Up

Note

Human Disasters

Accidents

Bad Code

Note

Malicious Attacks

Terrorism

Electronic Attacks

Ransomware

Internal Threats

Separation of powers

Role-based administration

Least privilege

Multiperson authentication

Mechanical or System Failure

Power Disruptions

There Is No Cloud

System Failure

Note

Natural Disasters

Floods

Fires

Earthquakes

Hurricanes, Typhoons, and Cyclones

Tornadoes

Sinkholes

Takeaways

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly