Chapter 4. Backup and Recovery Basics

Now that I’ve defined what backup and archive are and how to keep them safe, we need to drill down further into some basic backup and recovery concepts. I’ll start with discussing the all-important concept of recovery testing, followed by the concept of backup levels. I then look at many backup system metrics, especially the concepts of RTO and RPO and how they (more than anything else) determine backup design. I then talk about image-level versus file-level backups and how the contents of backups are selected. The first, and possibly most important basic backup concept, however, is that all backups must be tested.

Recovery Testing

There is no more basic concept of backup and recovery than to understand that the only reason we back up things is to be able to restore them. And the only way you’ll know whether you can restore the things you’re protecting is to test your ability to do so. Regular recovery testing should be a fundamental part of your backup system.

Besides testing the validity of the backup system and its documentation, regular testing also helps train your personnel. If the first time they’re executing a big restore is when they’re doing it in production, such a restore will be a much more stressful situation and more likely to be error prone. If they’ve done such a restore multitudinous times, they should just be able to follow their usual procedure.

You should regularly test the recovery of anything and everything you are responsible for. This includes small things and very large things. The frequency of testing of each thing should be related to how often a restore of such a thing happens. A few times a year might be appropriate for a big DR test, but you should be restoring individual files and VMs at least once a week per person.

The cloud has made all of this much easier, because you don’t have to fight for the resources to use for recovery. You just have to configure the appropriate resources in the cloud and then restore to those resources. This is especially true of large DR resources; it should be very easy to configure everything you need to do a full DR test in the cloud. And doing this on a regular basis will make doing so in production much easier. Tests should also include restoring common things in SaaS services, such as users, folders, and individual files or emails.

Note

A backup isn’t a backup until it’s been tested!

—Ben Patridge

Backup Levels

There are essentially two very broad categories of what the backup industry calls backup levels; you are either backing up everything (i.e., full backup) or you are backing up only what has changed (i.e., incremental backup). Each of these broad types has variations that behave slightly differently. Most of the backup levels are throwbacks to a bygone era of tape, but it’s worth going through their definitions anyway. Then I’ll explain the levels that are still relevant in “Do Backup Levels Matter?”.

Traditional Full Backup

A traditional full backup copies everything from the system being backed up (except anything you specifically told it to exclude) to the backup server. This means all files in a filesystem (unstructured data) or all records in a database (structured data).

It requires a significant amount of input/output (I/O), which can create a significant performance impact on your application. This is especially true if you are pretending that your VMs are physical machines, and you happen to be performing multiple, simultaneous, full traditional backups on several VMs on the same hypervisor node.

Figure 4-1 shows a typical weekly full backup setup, with three types of incremental backups that I will discuss next.

Traditional Incremental Backup

A traditional incremental backup will back up all filesystem files or database records that have changed since a previous backup. There are different types of incremental backups, and different products use different terminology for the different types. What follows is my best attempt to summarize the different types.

Unless otherwise specified, incremental backups are full-file incremental backups, meaning the system will back up a file if the modification time has changed or its archive bit has been set in Windows. Even if the user changed only one block in the file, the complete (i.e., full) file will be backed up. Block-level incremental and source-side deduplication backups (both of which are discussed later in this chapter) are the only incremental backups that do not behave this way.

Typical incremental backup

A typical incremental backup will back up all data that has changed since the previous backup, whatever type of backup it happened to be. Whether the previous backup was a full backup or another incremental backup, the next incremental backup will back up only data that has changed since the last backup. This is the most common type of incremental backup. You can see this behavior in Figure 4-1.

Cumulative incremental backup

A cumulative incremental backup backs up all data that has changed since the last full backup. This requires more I/O of the backup client than a typical incremental, and requires more bandwidth to transmit and more storage to store (assuming you’re not using deduplication). The advantage of this type of backup is that you only need to restore from the full backup and the latest cumulative incremental backup. Compare this with the typical incremental backup, when you need to restore from the full backup and each subsequent incremental backup. However, the advantage of this type of incremental really goes by the wayside if you are using disk as your backup target.

In Figure 4-1, you can see a cumulative incremental backup being run on Saturday night. It backs up anything that has changed since the full backup on Sunday. This would happen regardless of which night it is run.

This type of backup is often called a differential, but I prefer not to use that term, because some backup software products use that term to mean something very different. Therefore, I use the term cumulative incremental.

Incremental backup with levels

This type of incremental backup uses the concept of levels, each specified by a number, where 0 represents a full backup, and 1–9 represents other incremental backup levels. An incremental backup of a certain number will back up everything that has changed since a previous backup one level down. For example, if you run a level 2 backup, it will back up everything that has changed since the last level 1 backup. You can mix and match these levels for various results.

For example, you might do a level 0 (i.e., full backup) on Sunday and then a level 1 every day. Each level 1 backup would contain all data that has changed since the level 0 on Sunday. You could also do a level 0 backup on the first day of the month, a level 1 every Sunday, and a series of backups with increasing levels the rest of the week (e.g., 2, 3, 4, 5, 6, 7). Each Sunday’s backup would be a cumulative incremental (i.e., all data changed since the level 0), and the rest of the backups would behave as typical incremental backups, just like in the top half of Figure 4-1.

An interesting idea that uses levels is called the Tower of Hanoi (TOH) backup plan, which is illustrated in the bottom half of Figure 4-1. It’s based on an ancient mathematical progression puzzle of the same name. If you’re still backing up to tape and are worried about a single piece of media ruining a restore, TOH can help with that.

The game consists of three pegs and a number of different-sized rings inserted on those pegs. A ring may not be placed on top of a ring with a smaller radius. The goal of the game is to move all the rings from the first peg to the third peg, using the second peg for temporary storage when needed.¹

One of the goals of most backup schedules is to get changed files on more than one volume while reducing total volume usage. The TOH accomplishes this better than any other schedule. If you use a TOH progression for your backup levels, most changed files will be backed up twice—but only twice. Here are two versions of the progression. (They’re related to the number of rings on the three pegs, by the way.)

0 3 2 5 4 7 6 9 8 9

0 3 2 4 3 5 4 6 5 7 6 8 7 9 8

These mathematical progressions are actually pretty easy. Each consists of two interleaved series of numbers (e.g., 2 3 4 5 6 7 8 9 interleaved with 3 4 5 6 7 8 9). Please refer to Table 4-1 to see how this would work.

Table 4-1. Basic Tower of Hanoi schedule
Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
0	3	2	5	4	7	6

As you can see in Table 4-1, it starts with a level 0 (full) on Sunday. Suppose that a file is changed on Monday. The level 3 on Monday would back up everything since the level 0, so that changed file would be included on Monday’s backup. Suppose that on Tuesday we change another file. Then on Tuesday night, the level 2 backup must look for a level that is lower, right? The level 3 on Monday is not lower, so it will reference the level 0 also. So the file that was changed on Monday, as well as the file that was changed on Tuesday, gets backed up again. On Wednesday, the level 5 will back up just what changed that day, since it will reference the level 2 on Tuesday. But on Thursday, the level 4 will not reference the level 5 on Wednesday; it will reference the level 2 on Tuesday.

Note that the file that changed on Tuesday was backed up only once. To get around this problem, we use a modified TOH progression, dropping down to a level 1 backup each week, as shown in Table 4-2.

Table 4-2. Monthly Tower of Hanoi schedule

Su	0
Mo	3
Tu	2
We	5
Th	4
Fr	7
Sa	6

Su	1
Mo	3
Tu	2
We	5
Th	4
Fr	7
Sa	6

Su	1
Mo	3
Tu	2
We	5
Th	4
Fr	7
Sa	6

Su	1
Mo	3
Tu	2
We	5
Th	4
Fr	7
Sa	6

If it doesn’t confuse you and your backup methodology,² the schedule depicted in Table 4-2 can be very helpful. Each Sunday, you will get a complete incremental backup of everything that has changed since the monthly full backup. During the rest of the week, every changed file will be backed up twice—except for Wednesday’s files. This protects you from media failure better than any of the schedules mentioned previously. You will need more than one volume to do a full restore, of course, but this is not a problem if you have a sophisticated backup utility with volume management .

Block-level incremental backup

A block-level incremental only backs up bytes or blocks that have changed since the last backup. In this context, a block is any contiguous section of bytes that is less than a file. The key differentiator here is that something is tracking which bytes or blocks have changed, and that tracking mechanism will determine which of these blocks, bytes, or segments of bytes are sent in an incremental backup.

This requires significantly less I/O and bandwidth than the full-file incremental approach. It has become much more popular with the advent of disks in the backup system, because it creates many smaller backups, all of which have to be read in a restore. This would be very problematic in a tape world, but it’s no big deal if your backups are on disk.

The most common place where block-level incremental backup occurs today is in backing up hypervisors. The hypervisor and its subsequent VMs maintain a bitmap containing a map of all bits that have changed since a given point in time. Backup software can simply query the bitmap for all bytes that have changed since the specified date, and the hypervisor will respond with the results after it queries the bitmap.

Source-side deduplication

Source-side deduplication (or just source deduplication, to differentiate it from target dedupe) will be covered in more detail in Chapter 5, but it is technically a type of incremental. Specifically, it is an extension of the block-level incremental backup approach, except additional processing is applied to the new or changed blocks before they are sent to the backup server. The source dedupe process tries to identify whether the “new” blocks have been seen before by the backup system. If, for example, a new block has already been backed up somewhere else, it won’t need to be backed up again. This might happen if you are backing up a file shared among many people, or if you back up the operating system that shares a lot of files with other systems. This saves time and bandwidth even more than block-level incremental backup does.

Synthetic full backups

The traditional reason for periodic full backups is to make a typical restore faster. If you only performed one full backup (with a traditional backup product), followed by incrementals forever, a restore would take a very long time. Traditional backup software would restore all data found on the full backup, even if some of the data on that tape had been replaced by newer versions found on incremental backups. The restore process would then begin restoring new or updated files from the various incremental backups in the order that they were created.

This process of performing multiple restores, some of which are restoring data that will be overwritten, is inefficient to say the least. Since traditional restores were coming from tape, you also had to add the time required to insert and load each tape, seek the appropriate place on the tape, and eject the tape once it was no longer needed. This process can take over five minutes per tape.

This means that with this type of configuration, the more frequent your full backups are, the faster your restores will be because they are wasting less time. (From a restore perspective only, full backups every night would be ideal.) This is why it was very common to perform a full backup once a week on all systems. As systems got more automated, some practitioners moved to monthly or quarterly full backups.

However, performing a full backup on an active server or VM creates a significant load on that server. This gives an incentive for a backup administrator to decrease the frequency of full backups as much as possible, even if it results in restores that take longer. This push and pull between backup and restore efficiency is the main reason that synthetic full backups came to be. A synthetic full backup is a backup that behaves as a full backup during restores, but it is not produced via a typical full backup. There are three main methods of creating a synthetic full backup.

Synthetic full by copying

The first and most common method of creating a synthetic full backup is to create one by copying available backups from one device to another. The backup system keeps a catalog of all data it finds during each backup, so at any given point, it knows all the files or blocks—and which versions of those files or blocks—that would be on a full backup if it were to create one in the traditional way. It simply copies each of those files from one medium to another. This method will work with tape or disk as long as multiple devices are available.

The big advantage of this method of creating a synthetic full backup is that this process can be run any time of day without any impact to the backup clients, because the servers or VMs for which you are creating the synthetic full backup are completely uninvolved. When complete, the resulting backup usually looks identical to a traditional full backup, and subsequent incremental backups can be based on that full backup.

There are two downsides to this method, the first of which is that the process of copying data can take quite a bit of time but, as already mentioned, you can do it anytime, even in the middle of the day. The other downside is that it can also create quite an I/O load on disk systems being used as a source and target for this backup. This wasn’t so much of a problem in the tape world, because the source and target devices were obviously separate devices. But if you have a single target deduplication appliance, a synthetic full backup created with this method is the I/O equivalent of a full restore and a full backup at the same time. How much this affects your appliance will depend on the appliance.

Virtual synthetic full

There is another approach to synthetic full backups that is only possible with target deduplication systems (explained in more detail in Chapters 5 and 12). In a target deduplication system, all backups are broken into small chunks to identify which chunks are redundant.³ Each chunk is then stored as a separate object in the target dedupe appliance’s storage, resulting in each changed file or block being represented by many small chunks stored in the target deduplication system. This means that it is possible for this appliance to pretend to create a full backup by creating a new backup that simply points to blocks from other backups.

This method does require integration with the backup product. Although the dedupe system may indeed be able to create a full backup without the backup product, the backup product wouldn’t know about it and wouldn’t be able to use it for restores or to base incremental backups on. So the backup product tells the target deduplication system to create a virtual synthetic full backup, after which it creates one pretty much instantaneously. There is no data movement, so this method is very efficient, but it may be limited to certain backup types, such as VMs, filesystem backups, and certain supported databases .

Incremental forever

The idea of a synthetic full backup is to use various ways to create something that behaves like a full backup without actually having to do another full backup. Newer backup systems have been created from the ground up to never again need another full backup, synthetic or otherwise. Although early implementations of this idea did occur in the tape world, the idea of incremental forever (also called forever incremental) backups really took off in the world of disk backups.

A true incremental forever is only feasible when using disk as your primary target, because the backup system will need to access all backups at the same time for it to work. Another change is that backup cannot be stored inside an opaque container (e.g., tar or a proprietary backup format), as most backup products can. (Please do not confuse this term container with Docker containers. I just don’t have a better word.) Instead, the backup system will store each changed item from the latest incremental backup as a separate object, typically in an object-storage system.

This will work whether your incremental forever backup software product backs up entire files, parts of files, or blocks of data (as discussed in “Block-level incremental backup”). Your backup software would store each object separately—even the smallest object (e.g., file, subfile, block, or chunk)—allowing it to access all backups as one big collection.

During each incremental backup, the backup system will also see the current status of each server, VM, or application it backs up, and it knows where all the blocks are that represent its current state (i.e., the full backup). It doesn’t need to do anything other than hold on to that information. When it’s time for a restore, it just needs to know where all the objects that represent a full backup are and deliver them to the restore process. This means that all backups will be incremental backups, but every backup will behave as a full backup from a restore perspective, without having to do any data movement to create that full backup.

This backup method creates a full backup every day without any of the downsides of doing that, or doing it with synthetic full backups. The only real downside to this approach is that it needs to be built into the backup system from the beginning. It only works if the backup system is built from scratch to never again look for a full backup, synthetic or otherwise..

Do Backup Levels Matter?

Backup levels are really a throwback to a bygone era, and they matter much less than they used to. When I first started my backup job in the early ’90s, backup levels mattered a lot. You wanted to do a full backup every week and a cumulative incremental (all changes since the full) every day if you could get away with it. Doing backups that way meant you needed two tapes to do a restore. It also meant that much of the changed data was on multiple tapes, since each cumulative incremental was often backing up many of the files that were on the previous cumulative incremental. This method was popular when I was literally hand-swapping tapes into a drive when I needed to do a restore, so you really wanted to minimize the number of tapes you had to grab from the tape drawer (or bring back from Iron Mountain), because you had to swap them in and out of the drive until the restore was done. Who wanted to do that with 30 tapes (what you would need to do if you did monthly full backups)?

Move forward just a few years, and commercial backup software and robotic tape libraries really took over. I didn’t have to swap tapes for a restore, but there was one downside to a restore that needed a lot of tapes. If the robot had to swap 30 tapes in and out for a restore, it would add about 45 minutes to the process. This was because it took 90 seconds on average to load the tape and get to the first byte of data. I modified my typical setup to use a monthly backup, daily typical incremental backups, and a weekly cumulative incremental backup. This meant a worst-case restore would need eight tapes, which would add about 12 minutes instead of 45. And that’s how things were for a really long time.

For those who have moved on from tape as a target for backups, most of the reasons we did various levels of backups no longer apply. Even doing a full backup every day doesn’t waste storage if you’re using a good target deduplication system. There is also no loading of 30 incremental backup tapes when all your backups are on disk. Finally, there are newer backup systems that really only do one backup level: block-level incremental. This is all to say that the more you are using disk and other modern technologies, the less the previous section should matter to you.

What Is the Archive Bit in Windows?

The archive bit is a flag set on files in Windows. If the “ready for archiving” bit is set on a file in Windows, it indicates that a file is new or changed and that it should be backed up in an incremental backup. Once this happens, the archive bit is cleared.

The first problem I have with the archive bit is that it should be called the backup bit, because, as I mentioned in Chapter 3, backups are not archives. But the real issue I have is that the first backup program to back up the directory will clear the archive bit, and the next program will not back up the same file. If a regular user uses some third-party backup tool to back up their own files, it will clear the archive bit, and the corporate backup system in charge of backing up those files will not back them up. They don’t appear to be in need of backup, because the archive bit is not set. So any user can defeat the purpose of the entire backup system.

I’ve never been a fan of the archive bit. The good news is that it’s pretty much a nonfactor in most datacenter backups, because backups are running at a VM level. The archive bit is not being used to decide what gets backed up. #Winning

Metrics

You need to determine and monitor a number of metrics when designing and maintaining a data protection system. They determine everything from how you design the system to how you tell whether the system is doing what it’s designed to do. Metrics also determine how much compute and storage capacity you are using and how much you have left before you have to buy additional capacity.

Recovery Metrics

There are no more important metrics than those having to do with recovery . No one cares how long it takes you to back up; they only care how long it takes to restore. There are really only two metrics that determine whether your backup system is doing its job: how fast you can restore and how much data you lose when you do restore. This section explains these metrics and how they are determined and measured.

Recovery time objective (RTO)

The recovery time objective (RTO) is the amount of time, agreed to by all parties, that a restore should take after some kind of incident requiring a restore. The length of an acceptable RTO for any given organization is typically driven by the amount of money it will lose when systems are down.

If a company determines it will lose millions of dollars of sales per hour during downtime, it will typically want a very tight RTO. Companies such as financial trading firms, for example, seek to have an RTO as close to zero as possible. Organizations that can tolerate longer periods of computer downtime might have an RTO measured in weeks. The important thing is that the RTO must match the needs of the organization.

Calculating an RTO for a governmental organization, or a nonprofit company, can be a bit more problematic. They will most likely not lose revenue if they are down for a period of time. One thing they might need to calculate, however, is the amount of overtime they may have to pay to catch up if a prolonged outage occurs.

There is no need to have a single RTO across the entire organization. It is perfectly normal and reasonable to have a tighter RTO for more critical applications and a more relaxed RTO for the rest of the datacenter.

It’s important when calculating an RTO to understand that the clock starts when the incident happens, and stops when the application is completely online and business has returned to normal. Too many people focused on backup think the RTO is the amount of time they have to restore data, but this is definitely not the case. The actual process of copying data from backups to the recovered system is actually a small part of the activities that have to take place to recover from something that would take out an application. A hardware order might have to be placed, or some other contract or logistical issue might have to be resolved before you can actually begin a restore. In addition, additional things might have to happen after you perform your restore before the application is ready for the public. So remember when determining your RTO that it’s much more than just the restore that you have to make time for.

Remember that RTO is the objective. Whether you can meet that objective is a different issue explained in “Recovery time actual and recovery point actual”. But first we need to talk about RPO, or recovery point objective.

Recovery point objective (RPO)

RPO is the amount of acceptable data loss after a large incident, measured in time. For example, if we agree we can lose one hour’s worth of data, we have agreed to a one-hour RPO. Like the RTO, it is perfectly normal to have multiple RPOs throughout the organization, depending on the criticality of different datasets.

Most organizations, however, settle on values that are much higher than an hour, such as 24 hours or more. This is primarily because the smaller your RPO, the more frequently you must run your backup system. There’s not much point in agreeing to a one-hour RPO and then only running backups once a day. The best you will be able to do with such a system is a 24-hour RPO, and that’s being optimistic.

Negotiating your RPO and RTO

Many organizations might want a very tight RTO and RPO. In fact, almost every RTO and RPO conversation I have participated in started with a question of what the organization wanted for these values. The response was almost always an RTO and RPO of zero. This means that if a disaster occurred, the business/operational unit wants you to resume operations with no down time and no loss of data. Not only is that not technically possible even with the best of systems, it would be incredibly expensive to do.

Therefore, a response to such a request should be the proposed cost of the system required to meet that request. If the organization can justify an RTO and RPO of 0—or anything close to those values—they should be able to back it up with a spreadsheet showing the potential cost of an outage to the organization.

There can then be a negotiation between what is technically feasible and affordable (as determined by the organization) and what is currently happening with the disaster recovery system. Please don’t just take whatever RTO and RPO you are given and then just ignore it. Please also don’t make up an RTO and RPO without consulting the business/operational units because you think they’re going to ask for something unreasonable. This conversation is an important one and it needs to happen, which is why Chapter 2 is dedicated to it.

Recovery time actual and recovery point actual

The recovery point actual (RPA) and recovery time actual (RTA) metrics are measured only if a recovery occurs, whether real or via a test. The RTO and RPO are objectives; the RPA and RTA measure the degree to which you met those objectives after a restore. It is important to measure this and compare it against the RTO and RPO to evaluate whether you need to consider a redesign of your backup-and-recovery system.

The reality is that most organizations’ RTA and RPA are nowhere near the agreed-upon RTO and RPO for the organization. What’s important is to bring this reality to light and acknowledge it. Either we adjust the RTO and RPO, or we redesign the backup system. There is no point in having a tight RTO or RPO if the RTA and RPA are nowhere near it.

Tip

Consecutive backup failures happen a lot in the real world, and that can affect your RPA. One rule of thumb I use to ensure backup failures don’t impact my RPA is to determine backup frequency by dividing the RPO by three. For example, a three-day RPO would require a backup frequency of one day. That way, you can have up to two consecutive backup failures without missing your RPO. Of course if your backup system is only capable of performing daily backups as the most frequent, that would mean that your RPA would be three days if you had two consecutive backup failures that were not addressed. Since consecutive backup failures often happen in the real world, the typical RPO of 24 hours will rarely be met unless you are able to back up more often than once a day.

—Stuart Liddle

Testing recoveries

This is a perfect time to point out that you must test recoveries. The reason for this is that most organizations rarely get to fire their backup system in anger; therefore, they must pretend to do so on a regular basis. You will have no idea how your backup system actually performs if you don’t test it.

You won’t know how reliable your backup system is if you don’t test it with recoveries. You won’t know what kind of resources a large-scale recovery uses and how much it tasks the rest of the environment. You won’t have any idea what your RTA and RPA are if you don’t occasionally perform a large restore and see how long it takes and how much data you lose.

I will be talking about success metrics in a few pages. Backup success is an important metric, but there will always be failed backups. If your system is doing what it is supposed to be doing, frequent restores will show a restore success of 100% (or at least close to it). Advertising this metric will help build confidence in your recovery system.

Having participated in a few large-scale recoveries where I didn’t know the system capabilities very well, I can tell you the first question you are going to get is, “How long will this take?” If you haven’t been doing regular test restores, you will not be able to answer that question. That means you’ll be sitting there with knots in your stomach the whole time, not knowing what to tell senior management.

Be as familiar with restores as you are with backups. The only way to do that is testing .

Capacity Metrics

Whether you use an on-premises or cloud-based system, you need to monitor the amount of storage, compute, and network available to you and adjust your design and settings as necessary. As you will see in these sections, this is one area where a cloud-based system can really excel.

License/workload usage

Your backup product or service has a set number of licenses for each thing you are backing up. You should track your utilization of those licenses so you know when you will run out of them.

Closely related to this is simply tracking the number of workloads you are backing up. Although this may not result in a license issue, it is another metric that can show you growth of your backup system.

Storage capacity and usage

Let’s start with a very basic metric: Does your backup system have enough storage capacity to meet your current and future backup and recovery needs? Does your DR system have sufficient storage and compute capacity to take over from the primary datacenter in case of disaster? Can it do this while also conducting backups? Whether you are talking about a tape library or a storage array, your storage system has a finite amount of capacity, and you need to monitor that capacity and the percentage of it that you’re using over time.

Failing to monitor storage usage and capacity can result in you being forced to make emergency decisions that might go against your organization’s policies. For example, the only way to create additional capacity without purchasing more is to delete older backups. It would be a shame if failure to monitor the capacity of your storage system resulted in the inability to meet the retention requirements your organization has set.

This is a lot easier if the storage you are using in the cloud is object storage. Both object and block storage have virtually unlimited capacity in the cloud, but only object storage automatically grows to meet your needs. If your backup system requires you to create block-based volumes in the cloud, you will still have to monitor capacity, because you will need to create and grow virtual volumes to handle the growth of your data. This is not a requirement if you are using object storage.

Besides the downsides of having to create, manage, monitor, and grow these volumes, there is also the difference in how they are charged. Cloud block volumes are priced based on provisioned capacity, not used capacity. Object storage, on the other hand, only charges you for the number of gigabytes you store in a given month.

Throughput capacity and usage

Typical backup systems have the ability to accept a certain volume of backups per day, usually measured in megabytes per second or terabytes per hour. You should be aware of this number and make sure you monitor your backup system’s usage of it. Failure to do so can result in backups taking longer and stretching into the work day. As with storage capacity utilization, failure to monitor this metric may force you to make emergency decisions that might go against your organization’s policies.

Monitoring the throughput capacity and usage of tape is particularly important. As I will discuss in more detail in “Tape Drives”, it is very important for the throughput of your backups to match the throughput of your tape drive’s ability to transfer data. Specifically, the throughput that you supply to your tape drive should be more than the tape drive’s minimum speed. Failure to do so will result in device failure and backup failure. Consult the documentation for the drive and the vendor’s support system to find out what the minimum acceptable speed is, and try to get as close to that as possible. It is unlikely that you’ll approach the maximum speed of the tape drive, but you should also monitor for that.

This is one area where the cloud offers both an upside and a downside. The upside is that the throughput of the cloud is virtually unlimited (like cloud-based object storage), assuming the cloud-based product or service you are using can scale its use of the bandwidth available to it. If, for example, their design uses standard backup software running in a VM in the cloud, you will be limited to the throughput of that VM, and you will need to upgrade the type of VM (because different VM types get different levels of bandwidth). At some point, though, you will reach the limits of what your backup software can do with one VM and will need to add additional VMs to add additional bandwidth. Some systems can automatically add bandwidth as your needs grow.

One downside of using the cloud as your backup destination is that your site’s bandwidth is not unlimited, nor is the number of hours in a day. Even with byte-level replication or source-side deduplication (explained later in this chapter), the possibility is that you might exceed your site’s upload bandwidth. This will require you to upgrade bandwidth or, potentially, change designs or vendors. (The bandwidth required by different vendors is not the same.)

Compute capacity and usage

The capability of your backup system is also driven by the ability of the compute system behind it. If the processing capability of the backup servers or the database behind the backup system is unable to keep up, it can also slow down your backups and result in them slipping into the work day. You should also monitor the performance of your backup system to see the degree to which this is happening.

Once again, this is another area where the cloud can help—if your backup system is designed for the cloud. If so, it can automatically scale the amount of compute necessary to get the job done. Some can even scale the compute up and down throughout the day, lowering the overall cost of ownership by lowering the number of VMs, containers, or serverless processes that are used.

Unfortunately, many backup systems and services running in the cloud are using the same software that you would run in your datacenter. Because the concept of automatically adding additional compute was really born in the cloud, backup software products written for the datacenter do not have this concept. That means that if you run out of compute on the backend, you will need to add additional compute manually, and the licenses that go with it, yourself.

Backup Window

A traditional backup system has a significant impact on the performance of your primary systems during backup. Traditional backup systems perform a series of full and incremental backups, each of which can take quite a toll on the systems being backed up. Of course, the full backups will take a toll because they back up everything. Incremental backups are also a challenge if they are what are called full-file incrementals, meaning the system backs up the entire file even if only one byte has changed. It changes the modification bit in Linux or the archive bit in Windows, and the whole file gets backed up. Since the typical backup can really affect the performance of the backed-up systems, you should agree in advance on the time you are allowed to run backups, referred to as the backup window.

Back in the day, a typical backup window for me was 6 p.m. to 6 a.m. Monday to Thursday, and from 6 p.m. Friday to 6 a.m. Monday. This was for a typical work environment where most people were not working on the weekends, and fewer people were using the systems at night.

If you have a backup window, you need to monitor how much you are filling it up. If you are coming close to filling up the entire window with backups, it’s time either to reevaluate the window or redesign the backup system.

I also feel that you should assign a window to your backup products and just let it do the scheduling within that window. Some people try to overengineer their backup system, scheduling thousands of individual backups with their external scheduler. I have never found a situation in which an external scheduler can be as efficient with backup resources as the included scheduler. I think it’s more efficient and a whole lot less tedious. But I’m sure someone reading this will feel completely different.

Organizations that use backup techniques that fall into the incremental forever category (e.g., continuous data protection [CDP], near-CDP, block-level incremental backups, or source-side deduplication backups, all of which are explained elsewhere in this book) don’t typically have to worry about a backup window. This is because these backups typically run for very short periods (i.e., a few minutes) and transfer a small amount of data (i.e., a few megabytes). These methods usually have a very low performance impact on primary systems or at least less of a performance impact than the typical setup of occasional full backups and daily full-file incremental backups. This is why customers using such systems typically perform backups throughout the day, as often as once an hour or even every five minutes. A true CDP system actually runs continuously, transferring each new byte as it’s written. (Technically, if it isn’t running completely continuously, it’s not a continuous data protection system. Just saying.)

Backup and Recovery Success and Failure

Someone should be keeping track of how many backups and recoveries you perform and what percentage of them is successful. Although you should shoot for 100% of both, that is highly unlikely, especially for backup. You should definitely monitor this metric, though, and look at it over time. Although the backup system will rarely be perfect, looking at it over time can tell you whether things are getting better or worse.

It’s also important for any backup or recovery failures to be addressed. If a backup or restore is successfully rerun, you can at least cross the failed one off your list of concerns. However, it’s still important to keep track of those failures for trending purposes.

Retention

Although not technically a metric, retention is one of the things that are monitored in your backup and archive system. You should ensure that the defined retention policies are being adhered to.

Like RTO and RPO, the retention settings of your data protection system should be determined by the organization. No one in IT should be determining how long backups or archives should be kept; this should be determined by legal requirements, organizational requirements, and regulatory requirements.

Another thing to mention when discussing retention is how long things should be kept on each tier of storage. Gone are the days when all backups and archives were kept on tape. In fact, most of today’s backups are not kept on tape; they are kept on disk. Disk has a variety of classes from a performance and cost perspective. Retention should also specify how long things should be kept on each tier.

Once you determine your retention policies, you should also review whether your data protection system is adhering to the policies that have been determined. (I’m referring here to the organization’s policies, not the settings of the backup system.) Determine what your retention policies are and document them. Then periodically review your backup and archive systems to make sure that the retention policies that have been set within these systems match the policies that you determined for your organization. Adjust and report as necessary.

The Right to Be Forgotten

Closely related to retention is the idea of the right to be forgotten (formally, the right to erasure), which was popularized by the EU’s GDPR and the CCPA. The concept is relatively simple: Your personal information is yours, and you should have the right to say who can have it. If you withdraw your consent from a company, or if it never had it to begin with, it is supposed to erase all of your personal information from its systems.

Here’s food for thought: How do you address that in a backup system? Backup systems are made to remember, and you’re asking it to forget. I brought this issue up when the GDPR first went into effect, but I’ve never seen guidance on what this means. Are backups excluded from the right to erasure? How, exactly, would you go about deleting such data from backups? If you can’t delete such data from backups, how do you address restores? What happens if you accidentally restore a database that has deleted people in it?

I don’t yet have answers to most of these questions. At this point, I just think the questions need to be asked. It would be nice if the GDPR Commission and others requiring erasure would clarify this issue.

Using Metrics

One of the ways you can increase the confidence in your backup system is to document and publish all the metrics mentioned here. Let your management know the degree to which your backup system is performing as designed. Let them know how many backups and recoveries you perform, and how well they perform and how long it will be before they need to buy additional capacity. Above all, make sure that they are aware of your backup and recovery system’s ability to meet your agreed-upon RTO and RPO. Hiding your RTA and RPA will do no one any good if there is an outage.

I remember knowing what our company’s RTO and RPO were and knowing our RTA and RPA were nowhere near those numbers. We had a four-hour RTO and I could barely get tapes back from our vaulting vendor in that time. The restore itself for a full server was usually longer than four hours, and I knew that the restore couldn’t start until we replaced the hardware that had been damaged. I remember that we all just laughed at these metrics. Don’t do that. Figure 4-2 spells this out perfectly.

Curtis knew he couldn't meet his RTO or RPO, but said nothing because he was afraid to lose his job.

Be the person in the room bold enough to raise your hand in a meeting and point out that the RTO and RPO are nowhere near the RTA and RPA. Push for a change in the objectives or the system. Either way, you win.

Speaking of meetings, I’d like to dispel a few myths about backup and archive that you’re likely to hear in them. I hope these will allow you to respond to them in real time .

Backup and Archive Myths

There are many myths in the backup and archive space. I have argued every one of the myths in this section countless times with people in person and even more people on the internet. People are convinced of a particular idea and, often, no number of facts seems to change their mind, but this is my response to these myths:

You don’t need to back up RAID.: Redundant disk systems do not obviate the need for backup. You would think that this wouldn’t have to be debated, but it comes up way more often than you might think. RAID in all of its forms and levels, as well as RAID-like technologies like erasure coding, only protects against physical device failure. Different levels of RAID protect against different types of device failures, but in the end, RAID was designed to provide redundancy in the hardware itself. It was never designed to replace backup for one very important reason: RAID protects the volume, not the filesystem on top of the volume. If you delete a file, get ransomware and have the file encrypted, or drop a table in a database that you didn’t mean to drop, RAID can do nothing to help you. This is why you back up data on a RAID array, no matter what level.

Note

A friend of mine was using NT4 Workstation on a RAID1 array, but didn’t have any backups, because it was RAID1, and therefore “safe”. He was a database administrator (DBA), and a semi-professional photographer, with many thousands of photos on the same partition as the OS. A patch for the OS corrupted his filesystem due to a driver incompatibility, but his drives were fine. He lost all of his photos to this myth. I tried to help him recover, but the tools I had available were of no use.

—Kurt Buff

You don’t need to back up replicated data.

The answer to this is really the same as the section on RAID. Replication, no matter how many times you do it, replicates everything, not just the good things. If bad things happen to the data sitting on your replicated volume, the replication will simply copy those things over to another location. In fact, I often make the joke that replication doesn’t fix your error or stop the virus; it just makes your error or the virus more efficient. It will replicate your mistake or virus anywhere you told it to. Where this comes up these days is in multinode, sharded databases like Cassandra and MongoDB. Some DBAs of these products mentioned that every shard is replicated to at least three nodes, so it should be able to survive multiple node failures. That is true, but what happens if you drop a table that you didn’t mean to drop? All the replication in the world will not fix that. This is why you must back up replicated data.

You don’t need to back up IaaS and PaaS.

I don’t typically get into too many arguments with people who think they don’t need to back up their public cloud infrastructure. IaaS and PaaS vendors often provide facilities for you to be able to do your own backup, but I don’t know any that back up on your behalf. Cloud resources are wonderful and infinitely scalable. Many of them also have high-availability features built in. But just like replication and RAID, high availability has nothing to do with what happens when you make a mistake or are attacked, or the datacenter becomes a crater from an explosion.

The main point I want to make here is that it is really important to get your cloud backups out of your cloud account and the region where they were created. Yes, I know it’s harder. Yes, I know it might even cost a little bit more to do it this way. But leaving your backups in the same account that created them and storing them in the same region as the resources you are protecting doesn’t follow the 3-2-1 rule. Read the sidebar “There’s a New One” to learn what can happen when you don’t do this.

You don’t need to back up SaaS

I argue this myth two or three times a week at this point. Everyone seems to think that backups either should be or already are included as part of the service when you contract a SaaS vendor like Microsoft 365 or Google Workspace. But here’s a really important fact: such services are almost never included in the major SaaS vendors. If you doubt me, try to find the words backup, recovery, or restore in your service contract. Try also to find the word backup in the documentation for the product. I have searched for all of these, and I find nothing that meets the basic definition of backup, which is the 3-2-1 rule. At best, these products offer convenience restore features that use versioning and recycle bins and things like that, which are not protected against a catastrophic failure or attack. This topic is covered in more detail in “Software-as-a-Service (SaaS)”.

Backups should be stored for many years

Chapter 3 gives a solid definition of backup and archive. Backup products are not usually designed to do the job of archive. If your backup product requires you to know the hostname, application name, directory name and tablename, and a single date to initiate a restore, then that is a traditional backup product and not one designed to do retrievals. If you can search for information by a different context, such as who wrote the email, what words were in the email or the subject line, and a range of dates—and you don’t need to know the server it came from—then your backup product can do retrievals, and you can go read a different myth.

But most of you are dealing with a backup product that only knows how to do backups and restores, which means it doesn’t know how to do archives and retrievals. (Again, if you don’t know the difference, you really need to read Chapter 3, which goes into this topic in detail.) If your backup product doesn’t know how to do retrievals and you’re storing backups for several years, you are asking for trouble. Because if the data is accessible to you and you get an e-discovery request, you will legally be required to satisfy that request. If what you have is a backup product and not an archive product, you are looking at a potentially multimillion-dollar process to satisfy a single e-discovery request. If you think I’m exaggerating, read the sidebar “Backups Make Really Expensive Archives”.

Most restores are from the past 24 hours. I’ve done an awful lot of restores in my career, and very few of them have been from any time except the past few days, and even fewer were older than the past few weeks. I personally like to set the retention of the backup system to 18 months, which accounts for a file that you only use once a year and didn’t realize it was deleted or corrupted last year.

After that, things get a lot more complicated. Server names change, application names change, and you don’t even know where the file is anymore. The file might also be incompatible with the current version of software you’re running. This is especially true in database backups.

If you have an organizational need to keep data for many years, you need a system that is capable of that. If you are one of the rare organizations using a backup system that is truly capable of both backup and archive, go in peace. But if you are one of the many organizations using your backup system to hold data for seven years or—God forbid—forever, please seriously reconsider that policy. You are asking for trouble.

Tape is dead

I haven’t personally used a tape drive to make a backup in several years. I haven’t designed a new backup system to use tape drives as its initial backup target in at least 10 years, probably more. With very few exceptions, tape is pretty much dead to me as an initial target for backups for all the reasons discussed in “Tape Drives”. (I will consider using it as a target for a copy of a backup, which I go into in Chapter 9.)

I don’t have this opinion because I think the tape is unreliable. As I discuss in the aforementioned “Tape Drives” section, I think tape is fundamentally incompatible with how we do backups today. Tapes want to go much faster than typical backups run, and the incompatibility between these two processes creates the unreliability that some people believe tape drives have.

The irony is that tape drives are actually better at writing ones and zeros than disk is; they’re also better than disk at holding on to ones and zeros for longer periods. But trying to make a tape drive that wants to go 1 GB a second happy with a backup that is running a few dozen megabytes per second is simply impossible.

And yet, more tape is sold today than ever before. Giant tape libraries are being sold left and right, and those tape libraries are storing something. As I said elsewhere, the worst-kept secret in cloud computing is that these big cloud vendors buy a lot of tape libraries. So what is all this tape being used for?

I think the perfect use for tape is long-term archives. Don’t try to send an incremental backup directly to tape; you are asking for trouble. But if you happen to create a large archive of a few terabytes of data and you have it stored on disk right next to that tape drive, you should have no problem streaming that archive directly to tape and keeping that tape drive happy. The tape drive will reliably write that data to tape and hold on to that data for a really long time. You can create three or more copies for next to nothing and distribute those copies around the world as well.

Tape is an incredibly inexpensive medium. Not only is the medium, and the drives that create it, incredibly inexpensive, the power and cooling of a tape library costs much less than the power and cooling of a disk system. In fact, a disk system would cost more even if the disks themselves were free. (Over time, the power and cooling costs of the disk drives will outweigh the power and cooling costs and the purchase cost of the tape drives.)

You are probably using more tape than you think if you are using very inexpensive object storage in the cloud. I don’t know this for a fact, but the behavior of many of these systems sounds an awful lot like tape. So although tape may be retiring from the backup and recovery business, it has a long life in the long-term storage business.

I have one final thought on the subject. I had a conversation recently with a fellow IT person. He was explaining to me how they had datacenters on a Caribbean island that didn’t have good internet, so they did not feel that cloud-based backup was a way to safeguard their data, because they just don’t have the bandwidth for it. Thus, they use disk backups to create an on-premises backup that is then replicated to an off-premises array, which is then copied to tape and sent to Iron Mountain. I said the chances that they would ever actually use that tape are next to none, and then he reminded me of a hurricane that just took out that entire island not that long ago. Tapes were all they had. Like I said, tapes are not dead.

Now that those myths are out of the way, let’s continue our journey through the basics of backup and recovery. The next thing to think about is the unit you will be backing up. Will you back up individual items (e.g., files) or entire images?

Backups Make Really Expensive Archives

I worked with a customer many years ago who received a single electronic discovery request for emails that matched a particular set of criteria during a period of three years. This customer didn’t have an email archive system, but they did have a weekly full backup of Exchange for the past three years.

If they’d had an email archive, they could’ve done a single request that said something like this: show me all the emails for the past three years written by Curtis that say the phrase “3-2-1 rule.” They would soon be presented with a downloadable PST file that they could hand to the lawyers.

But they didn’t have an email archive system; they had a backup system. So here’s what they had to do:

Set the restore time frame to 156 weeks ago (three years).
Perform an alternate-server Microsoft Exchange restore (a very complicated task indeed) of the entire Exchange server from 156 weeks ago.
Search Curtis’s Sent Items folder for any emails written in the past week with the phrase “3-2-1 rule.”
Subtract 1 from the current number of weeks (156 –1 = 155).
Repeat steps 2–4 155 more times.

We were able to do three restores at a time. Each restore took a long time and had to be executed by someone who really knew what they were doing. (Alternate server Exchange restores are no joke.) A team of consultants, hired specifically for the task, worked 24 hours a day for several months to accomplish it. It cost the client $2 million in consulting fees.

Like I said, backups make really expensive archives .

Item- Versus Image-Level Backups

There are two very different ways to back up a server: item-level backup and image-level backup. Item level is usually called file level, although you are not always backing up files. (It might be objects.) Image level is currently most popular when backing up virtualized environments. They both come with their own advantages and disadvantages.

Item-Level Backup

An item-level backup backs up discrete collections of information that are addressed as individual items, and the most common type of item is a file. In fact, if this section were being written several years ago, this would most likely be called file-level backup.

The other type of item that might be included in an item-level backup is an object in an object-storage system. For many environments, objects are similar to files in that most organizations using object storage are simply using it to hold on to what would otherwise be files, but since they are being stored in an object storage system, they are not files, because files are stored in a filesystem. The contents are often the same, but they get a different name because they are stored differently.

You typically perform item-level backup if you are running a backup agent inside the server or VM itself. The backup agent is deciding which files to back up by first looking at the filesystem, such as C:\Users or /Users. If you are performing a full backup, it will back up all the files in the filesystem. If you are performing an incremental backup, it will be backing up files that have changed since the last backup. You are also performing an item-level backup if you are backing up your object-storage system, such as Amazon S3, Azure Blob, or Google Cloud Storage. The idea of whether to back up object storage is covered in “Object storage in the cloud”.

The advantage of an item-level backup is that it is very easy to understand. Install a backup agent in the appropriate place, and it will examine your file or object-storage system, find all the items, and back them up at the appropriate time.

Image-Level Backups

An image-level backup is the result of backing up either a physical or virtual device at the block level, creating an image of the entire drive. This is why, depending on your frame of reference, image-level backups are also referred to as drive-level, volume-level, or VM-level backups. The device could be storing a variety of information types, including a standard filesystem, block storage for a database, or even the boot volume for a physical or virtual machine. Within an image-level backup, you’re backing up the building blocks of the filesystem rather than backing up the files themselves.

Prior to the advent of virtualization, image-level backups were rare because backing up the physical drive was a lot harder and required unmounting the filesystem while you backed up the blocks. Otherwise, you risked a contaminated backup, where some of the blocks would be from one point in time and some of the blocks would be from another point in time. Virtual snapshot technology, such as is found in Windows Volume Shadow Services (VSS) or VMware snapshots, solved this underlying problem.

Backing up at the volume level became much more popular once VMs came on the scene. Image-level backups allow you to perform a backup of a VM at the hypervisor level, where your backup software runs outside the VM and sees the VM as one or more images (e.g., VMDK files in VMware).

Backing up at the image level has a number of advantages. First, it provides faster backups and much faster restores. Image-level backups avoid the overhead of the file- or object-storage system and go directly to the underlying storage. Image-level restores can be much faster because file-level backups require restoring each file individually, which requires creating a file in the filesystem, a process that comes with quite a bit of overhead. This problem really rears its ugly head when restoring very dense filesystems with millions of files, when the process of creating the files during the restore actually takes longer than the process of transferring the data into the files. Image-level restores do not have this problem because they are writing the data straight to the drive at the block level.

Once the changing block issue was addressed with snapshots, backup systems were presented with the second biggest challenge of image-level backups: incremental backups. When you are backing up at the drive, volume, or image level, every file is a full backup. For example, consider a VM represented by a virtual machine disk (VMDK) file. If that VM is running and a single block in the VM changes, the modification time on that image will show that it has changed. A subsequent backup will then back up the entire VMDK file, even though only a few blocks of data might have changed.

This challenge has also been solved in the VM world via changed-block tracking (CBT), which is a process that keeps track of when a previous backup was created, and the blocks that have changed since that last backup. This allows an image-level backup to perform a block-level incremental backup by using this protocol to ask which blocks have changed and then copying only those blocks.

File-Level Recovery from an Image-Level Backup

This leaves us with one final disadvantage of backing up at the image level, and that is the lack of item-level recovery. Customers do not typically want to restore an entire VM; they want to restore a file or two within that VM. How do you restore a single file from a VM when you backed up the entire VM as a single image? This is also a problem that has been solved by many backup software companies. For example, in the case of a VMware VM, they understand the format of VMDK files, which allows them to do a number of things.

One option that some backup products allow you to do is mount the original VMDK files as a virtual volume that can be made available by the file explorer on the backup server or any client where the backup software runs. The customer can then drag and drop whatever file(s) they are looking for from that image and tell the backup software to unmount it. In this case, the image is usually mounted read-only, facilitating these drag-and-drop type restores. (Mounting the VM image read-write and actually running the VM from that image is called instant recovery and is covered in Chapter 9.)

Other backup software products can index the images ahead of time, so they know which files use which blocks within the image. This allows these products to support regular file-level restores from these images without requiring the customer to mount the image and manually grab the files. The customer would use the same workflow they always use to select files for restore, and the backup system would do whatever it needs to do in the background to restore the files in question.

Combining Image- and File-Level Backups

Most customers are performing image-level backups of their VMs while still retaining the ability to perform both incremental backups and item-level restores. They also want block-level incremental backups, which are actually much more efficient than item-level incremental backups.

Backing up at the VM level (i.e., image level) also comes with the potential to restore the VM easily as a single image. This makes what we used to call bare-metal recovery so much easier than it was. You get all the bare-metal recovery capabilities you need without having to jump through hoops to address the changing block issues historically found in image-level backups.

We even have image-level backups of physical Windows servers, since most people are using Windows VSS to create a snapshot of each filesystem prior to backing it up. This allows the backup software product to back up at the image level without risking data corruption.

Once you’ve decided what you’re backing up, you need to know how the things to be backed up are selected by the backup product. This section is very important, because picking the wrong method can create significant gaps in your backup system .

Backup Selection Methods

Understanding how systems, directories, and databases are included in the backup system is the key to making sure that the files you think are being backed up are indeed being backed up. No one wants to find out the system or database they thought was protected wasn’t protected at all.

One caveat, though. Your backup selection methods only work once your backup system knows about the systems being backed up. That means the first step toward this goal is making sure servers and services you want backed up are registered with your backup and recovery system.

For example, if you start using a new SaaS such as Salesforce, the backup system will not automatically notice that addition and start backing it up for you. If you are fully virtualized on VMware and your backup system is connected to vCenter, systems will automatically notice if you add a new node to the configuration. But if you start using Hyper-V or kernel virtual machine (KVM), the backup system will not automatically notice there is a new hypervisor in the datacenter and start backing it up. And of course your backup system will not notice you installed a new physical server. So these selection methods come with this caveat.

Selective Inclusion Versus Selective Exclusion

There are two very broad categories of how items can be included in a backup system: selective inclusion and selective exclusion.

In selective inclusion, the administrator individually specifies which filesystems, databases, or objects the backup system will back up. For example, if an administrator says they want to back up just the D:\ drive, or just the Apollo database, they are practicing selective inclusion.

In selective exclusion, AKA automatic inclusion, an administrator specifies backing up everything on the server except what is specifically excluded. For example, an administrator might select all filesystems except for /tmp on a Linux system or a user’s iTunes or Movies directories on a Windows laptop.

It’s very common for administrators to think they administer systems in such a way that there is no point in backing up the operating system. They know they want to back up C:\Users on a Windows laptop, /Users on a MacBook, or something like /data or /home on a Linux system. They see no point in backing up the operating system or applications, so they manually select just the filesystems they want to back up. The same is true of databases. They might not want your backup test databases, so they selectively include which databases to back up.

The problem with selective inclusion is configuration changes. Every time a new database or filesystem with data is added to a system, someone needs to change the backup configuration; otherwise, the new resource will never be backed up. This is why selective inclusion is much less safe than selective exclusion.

With selective exclusion, the worst possible side effect is that you might back up some worthless data (assuming you forgot to exclude it). Compare this to the worst possible side effect of selective inclusion, in which important data is completely excluded from the backup system. There is simply no comparison between the two. Selective inclusion may appear to save money because less data is stored, but it’s much riskier.

It is easy to exclude data that you know to be worthless, such as /tmp or /temp on a Linux system. If you see no reason to back up the operating system, you might also exclude /, /user, /usr, /var, and /opt, although I would back up those directories. On a Windows system, you could exclude C:\Windows and C:\Program Files if you really don’t want to back up the OS and application programs.

One thing to consider, though, is the effect deduplication might have on this decision. It’s one thing to know you are backing up hundreds or thousands of filesystems that have no value, and wasting valuable storage space on your disk array or tape library. But what if the operating system that you are spending so much time excluding is actually only stored once?

Deduplication would ensure that only one copy of the Windows or Linux operating system is actually stored in your backup system. Considering this, perhaps you could just leave the backup system at its default configuration and not worry about excluding the operating system, because the cost to your backup system will be very small.

Tag-Based and Folder-Based Inclusion

Another way to add backup data to the backup system automatically is by tag-based and folder-based inclusion. This has become popular in the virtualization world, where each new VM or database that is created can be given one or more tags (or placed in a particular folder) that can be used to classify the type of VM or database.

For example, all new database servers might be given the database tag or put in the database folder, indicating to multiple other processes that it is a database-related VM. This might tell certain monitoring systems to monitor whether the database is available. It might also automatically apply certain security rules and firewalls to that VM. And in many backup systems, it can also automatically apply a database-centric backup policy to that VM as well.

One important thing to note when using this method: You need a default backup policy. You should create a backup policy that your backup software automatically uses if no appropriate tags are found, or if a VM is not in a particular folder. Then make sure to monitor that default policy for any new systems that show up, because it means that the data on those systems might not be getting properly backed up. If your backup software product does not support a default backup policy when used with this method, it might be best not to use this functionality, because it comes with the risk of new VMs or databases not being backed up.

Note

We have our server build process automated and VMs are backed up by an SLA added to the folder the VMs go in. For database backups, we need to add an agent and add them to our backup software and add an SLA. This is all done during the server build when DB is checked. That way we hopefully don’t miss a backup that we weren’t aware of.

—Julie Ulrich

Takeaways

If you’re designing or operating a backup system for which you do not have an agreed-upon SLA for RTO and RPO, you are asking for trouble. Take the time now to document one, and remember that it should come from the organization—not from IT. In addition, if you have a documented RTO and RPO and you know your RTA and RPA are nowhere near it, now is the time to speak up.

As to the backup myths, all sorts of them say you don’t need to back up X, Y, or Z. That is almost never true. If you have a system that truly does have backup and recovery as part of its feature set, make sure to get that in writing.

As to backup levels, I think most people don’t need that entire section. Most modern backup products are doing block-level incremental backups forever and really don’t use the backup levels from days gone by.

Default to selective exclusion (i.e., automatic inclusion). Spend your valuable time on other administrator activities, without having to worry whether your new database is being backed up. Backup priorities should always be safety and protection first, cost second. No one ever got fired because their backup system backed up too much data, but plenty of people have been fired for not backing up enough data.

The next chapter is dedicated to the way most people back up today: disk as their primary target. I’ll cover the different ways people use disk in their backup systems and follow that up with different ways people recover.

¹ For a complete history of the game and a URL where you can play it on the web, see http://www.math.toronto.edu/mathnet/games/towers.html.

² This is always the case for any recommendation in this book. If it confuses you or your backup methodology, it’s not good! If your backups confuse you, you don’t even want to try to restore! Always Keep It Simple SA . . . (K.I.S.S.).

³ A chunk is some collections of bytes. Most people use the term chunk versus block, because blocks tend to be a fixed size, and chunks can be any size. Some dedupe systems even use variable-sized chunks.

Get Modern Data Protection now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 4. Backup and Recovery Basics

Recovery Testing

Note

Backup Levels

Traditional Full Backup

Figure 4-1. Full and incremental backups

Traditional Incremental Backup

Typical incremental backup

Cumulative incremental backup

Incremental backup with levels

Block-level incremental backup

Source-side deduplication

Synthetic full backups

Synthetic full by copying

Virtual synthetic full

Incremental forever

Do Backup Levels Matter?

Metrics

Recovery Metrics

Recovery time objective (RTO)

Recovery point objective (RPO)

Negotiating your RPO and RTO

Recovery time actual and recovery point actual

Tip

Testing recoveries

Capacity Metrics

License/workload usage

Storage capacity and usage

Throughput capacity and usage

Compute capacity and usage

Backup Window

Backup and Recovery Success and Failure

Retention

Using Metrics

Figure 4-2. Don’t be a Curtis

Backup and Archive Myths

Note

Item- Versus Image-Level Backups

Item-Level Backup

Image-Level Backups

File-Level Recovery from an Image-Level Backup

Combining Image- and File-Level Backups

Backup Selection Methods

Selective Inclusion Versus Selective Exclusion

Tag-Based and Folder-Based Inclusion

Note

Takeaways

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly