For more than two years I am furious about what standard backup programs deliver today. Yes, they provide you safety by making copies of your data reoccuringly ensuring that you can copy it back upon necessity. But when you have a closer look at what they really do especially in the sense of versioning it suddently turns into a surprise. I stumbled over that when I investigated the following scenario more thoroughly:
Scenario
For several years now I administer a local network with some workstations and two servers. For historical reasons one of the server is a Linux box (LinS) which is mainly responsible for doing the internet-related routing and similar stuff whilst the Windows Server (WinS) acts as simple file share server for the core application. Safety of the data is considered key in this situation – especially concerning the Windows Server (loosing the internet access for a day or two is also not eligible, but survivable). Still both servers feature a RAID1 system. Previously, backup of data for the Windows Server was done via DVD+RW with a simple commercial backup software. As the RAID1 array in the WinS already ensured that ongoing failures did not cause an unrevokable data loss, still “pushing the DEL button” on the keyboard could cause an entire data loss (this is also the reason why “only backing up your data on a RAID array” is never a good idea alone). Consequently, the major task in this situation for this commercial software was to provide differential backups such that – if required – it would be possible to restore a previous state (to be also immune against data that was garbled some days before).
The Archive Bit Algorithm
Having formulated this requirement to the software, however, was said easily. In real life incremental or differential backups in most cases focus only on files that were changed – and this is meant literally. Many backup software implementations – especially in the incremental backup case – only refer to the archive bit of the directory entry of all files within the backup selection. And here is how it works:
- When a file is created, it automatically gets the archive bit set.
- When a file has been changed, the operating system (especially Windows) automatically sets the archive bit of that file, if it is not already set.
- The incremental backup software checks if there are files which have the archive bit set. If it is set, this file is considered as being changed and thus it is stored in the current backup set. Once the backup has been executed successfully, the archive bit of all files backed up is reset/cleared.
Although at first glance this looks as a very efficient and appealing algorithm, it is not. There is one major drawback which I found during a short simulated backup restore drill: What about deleted files? They simply do not have archive bits!
Versioning and Backup Software
In my case it became evident that the backup software used exactly the algorithm stated above (and BTW this was the reason why it was so fast, one of the advertised “features” of this product). However, the non-existance of a file can have semantics, too. In our use case, the application for which the data was backed up uses a dedicated file for each object to indicate a distributed lock (also not a very intelligent approach, but that’s another story). Restoring all the incremental backups thus caused all old locks “reappearing”. Fortunately there was an automatism inside the application to detect stale locks, so no severe impact (besides an initial bad performance when the system cleaned up the mess) was observed. Shortly afterwards, some very fancy, inconsistent behaviour proved previous rumbles out of my stomache: Some time before (during the time frame of the incremental backups) the application moved files from one directory to another. As usual, a move was implemented as a copy from the source to the destination with a clean up afterwards at the source location. Thus, the archive bit was set properly (at the target position). As the backup ignored deletions, the “old version” at the point in time when the move was executed “reappeared”. Whenever the implementation of the application looked at the “previous location” (for example in case of compatibility) it “fell over” the old version which did not fit to the rest of the data it had read before. In short, looking at the backup application, this is an excellent example for the situation where non-completeness of a feature may cause significant inconsistency which may happen to be detected “too late”.
Versioning and Subversion
Some time ago I experienced similar issues on consistencies with another program (there it was not on files, but on directories). It was Concurrent Versions System or better known by its acronym CVS. At the end of this story I decided to switch the program and move to the next generation versioning tool called Subversion (svn). Subversion is an application which is exactly built for the purpose of versioning and uses the paradigm of database transactions to ensure ACID-like operations. By this it can even deal with versioning of directories. It belongs to the group of revision control programs (more resource on this tool you can find the additional resource section at the end of this post). When I saw the similarity of those two approaches I wondered if it would not be possible to reuse Subversion also for providing historical information on backup strategies. Out of that experience I knew that svn did not suffer the same “stale file” issue as the backup program above had. And even better: Subversion has a very compelling diff algorithm for storing deltas between versions in a very compact manner — even in the case of binary files (even better: the application mentioned above mostly uses text files for data storage, so this was not a key differentiator). After all it provides an open-source license which makes it quite cost-efficient especially on acquisition. Yet, there were some minor obstacles.