Incremental backups

Most backup software claims to be able to perform incremental or differential backups. Almost universally, they do it wrong. This paper attempts to explain the problem, why it is more difficult than it seems, and some possible solutions; including the solution taken by Adump.

The problem

From a backup software point of view, incremental and differential backups are the same problem. What name they should be given is more a matter of policy, rather than algorithm. In either case, the problem is to reduce the size of a particular backup based on the assumption that certain earlier backups are available.

For sake of terminology, we will define the following terms for discussing backups:

Prior backup
A backup that was performed earlier in time. Although this prior backup may not be a full backup, we assume that it solves the incremental backup problem correctly, therefore when restored, the filesystem is in the same state it was in at the time of the backup.
Current backup
The backup we are performing right now. It attempts to capture the state of the filesystem at this point in time. For the sake of this paper, we assume that is it possible for a user to restore a filesystem to the state of the prior backup

To sum this up, we need to store the differences between the prior backup and the current backup.

Create or modify

Before we get into the ways serious programs can mess up backups, let's talk a little bit about some real naive ways to backup.

Every file on a Unix-type filesystem contains three timestamps: ctime, mtime, and atime. The atime changes every time the file is accessed. This isn't real significant to backup software, and is actually a problem for us, since backing up the file will change this time. However, trying to put back the access time makes things much worse.

The other two times together indicate two different things. One is a timestamp that is good for representing the last modification time of a file. This is used by programs such as make and cook to determine when files have changed.

The ctime is present for the simple reason that the user can set the mtime. When a file is extracted from a tar archive, tar sets the mtime of the file to the same time it was when it was archived away. Correct utilities should only set the mtime back when the file data is restored to the same as it was at that time. Whether this really happens or not is not of concern to us.

So what is the ctime for? The OS sets the ctime to the current time whenever the mtime changes. (Not quite, there are a few other things that can change it). It is obviously set to the current time when the file is created (hence the origin of the name), but it is updated when the mtime changes.

Backup software must then look at both ctime and mtime to determine the date of the file. If the ctime is newer, it may mean that the file was restored from an archive. Although the user wishes to see the file as being old, the backup software should back it up, since it has been changed. A simple way for the backup software to do this is to use the newer of ctime and mtime as the timestamp for the file.

How to do it wrong

Many backup utilities include an innocent sounding option stating that only files newer than a given date should be backed up. This will certainly exclude many files from the backup that haven't changed. But, it also can leave out many files from the backup that have changed.

Keep in mind that we are discussing the technique of determine how to backup the files. Difficulties in determining this date/time are a matter of policy, and are not usually difficult to solve.

What about deleted files?

The most glaring omission by this strategy is that it does nothing about files that have been deleted. The restored backup will have lots of extra files that were not present in the filesystem at the time the backup was performed.

This obviously puts us in conflict with the problem we are trying to solve by backing up (restore the filesystem to the state it was in at the time of the backup). It can also cause more significant problems, such as there not being sufficient space upon restore because of all of the extra files.

I've been moved

There is an even deeper problem with this backup model. Immagine a directory layout something like the following:

  dir-1/
      sub-a/
            myfile
where myfile is an ordinary file two directories down. Between the prior backup and the current backup, the user renames dir-1 to dir-2. None of the time fields on myfile change at all here. However, the file is in a different place. The timestamp backup does not catch at all that this has changed.

Ok, how to do it right

There are several ways of getting these backups right. First, we will begin by looking at modern versions of GNU tar. Then we will look at two other ways to do it.

GNU tar

GNU tar now has an option -g for listed incremental backups. This backup format stores a text file for each backup run that captures snapshot data about the backup. Let's take a look at what is in this file.

The first line is a decimal representation of the timestamp of the backup. GNU tar uses this in many cases to make backup decisions.

Below this is a single line for every directory present in the backup. The pair of numbers for each is the device and inode number of that directory. This allows gnu tar to catch cases where dir-1 was renamed to dir-2, and then dir-3 was renamed to dir-1. If GNU tar encountes a directory that was not present in the prior backup, or was a different directory node, it ignores the dates on all files below that level, and always backs them up.

The listed incremental option also solves the problem of file deletions. Before each directory contents is output into the backup, GNU tar writes a special extended file that lists the contents of that directory. Upon restore of an incremental backup, GNU tar will delete any files that were not present at the time of backup. Although this would be dangerous to run if other files were present, it certainly works correctly when fully restoring backups.

A Database Approach

Another way to do the incremental backups is to store a database of the state of the filesystem at the time of each backup. This database can be compared with the current tree as the filesystem is scanned.

Adump currently uses this approach. It also allows it to generate a list of files and directories that can be removed.

This approach has a couple of advantages. First of all, we can generate the remove list, which we output before the archive containing all of the new file data. This prevents us from having to extend the backup file format. Second, it works (mostly) on filesystems that don't have a ctime field (largely promoted by a company in Washington).

There is a disadvantage, though, and that is the space to store the database. The database does end up being quite a bit smaller than the filesystem (only stores names and timestamps for every file). If backup software keeps online databases for quick recover of individual files, this information would need to be kept anyway. It is likely that Adump will continue to use this format.

An inode Perspective

A third way to do incremental backups is to use the approach used by the dump/restore programs from BSD. Instead of viewing the filesystem as a tree of names and data, it views the filesystem the way the OS does, as a collection of inodes. Each inode corresponds to a file. The directory structure gives a mapping between names and directories to the actual inodes. Dump backs up both this directory structure and the inodes themselves. Because it writes the inodes independently of their location, it handles the moved directory problem easily. Backups are also quite a bit faster since directories aren't traversed.

One difficulty with this backup method is how to get to the inode data. Without special OS extensions, you can't open files by inode, you have to go through the dir/file interface. This makes the backup order no longer correspond to the file layout, and will probably greatly reduce efficiency of the backup.

Another greater difficulty is that we have moved the complexity of figuring out directory changes to the restore utility. The database we thought we could get away from is now needed for the restore. My philosophy on this is that if complexity can be placed on backup or restore, put it on the backup. Any problems with the database code can be caught with the backup software, which is run frequently.

So what does Adump do

The first part of Adump that I wrote was the engine to process incremental backups. I consider this a very important part of backing up. I have too much data to do full backups every day (plus, I don't like buying tapes drives more often than I need to). Doing incremental backups wrong would be even worse though.

It turns out that this incremental engine was quite a bit of work. I had to come up with a database, as well as figure out all of the special cases that could happen with directories and files being added or removed. I have heavily tested this and have quite a bit of confidence in it.

After writing this code, I then moved to the actual backup. I've started with the CPIO archive format, because it is simple, and contains all that I need (tar fails on the simple part, and has weird problems with longer filenames). Later, I expect to move to my own archive format, but that will be the ramblings of another paper.


This page last modified 2001/09/29.