{"id":19,"date":"2010-01-21T19:37:10","date_gmt":"2010-01-21T19:37:10","guid":{"rendered":"http:\/\/blog.cts-schmoigl.com\/?p=19"},"modified":"2010-02-12T21:51:57","modified_gmt":"2010-02-12T21:51:57","slug":"using-subversionapache-webserver-as-backup-storage","status":"publish","type":"post","link":"http:\/\/blog.schmoigl-online.de\/?p=19","title":{"rendered":"Using Subversion+Apache Webserver as Backup Storage"},"content":{"rendered":"<p>For more than two years I am furious about what standard backup programs deliver today. Yes, they provide you safety by making copies of your data reoccuringly ensuring that you can copy it back upon \tnecessity. But when you have a closer look at what they really do especially in the sense of versioning it suddently turns into a surprise. I stumbled over that when I investigated the following scenario more thoroughly:<\/p>\n<h4>Scenario<\/h4>\n<p>For several years now I administer a local network with some workstations and two servers. For historical reasons one of the server is a Linux box (LinS) which is mainly responsible for doing the internet-related routing and similar stuff whilst the Windows Server (WinS) acts as simple file share server for the core application. Safety of the data is considered key in this situation &#8211; especially concerning the Windows Server (loosing the internet access for a day or two is also not eligible, but survivable). Still both servers feature a RAID1 system. Previously, backup of data for the Windows Server was done via DVD+RW with a simple commercial backup software. As the RAID1 array in the WinS already ensured that ongoing failures did not cause an unrevokable data loss, still &#8220;pushing the DEL button&#8221; on the keyboard could cause an entire data loss (this is also the reason why &#8220;only backing up your data on a RAID array&#8221; is never a good idea alone). Consequently, the major task in this situation for this commercial software was to provide differential backups such that &#8211; if required &#8211; it would be possible to restore a previous state (to be also immune against data that was garbled some days before).<\/p>\n<p><!--more--><\/p>\n<h4>The Archive Bit Algorithm<\/h4>\n<p>Having formulated this requirement to the software, however, was said easily. In real life <a href=\"http:\/\/en.wikipedia.org\/wiki\/Incremental_backup\">incremental or differential backups<\/a> in most cases focus only on files that were <strong>changed<\/strong> &#8211; and this is meant literally. Many backup software implementations &#8211; especially in the incremental backup case &#8211; only refer to the archive bit of the directory entry of all files within the backup selection. And here is how it works:<\/p>\n<ul>\n<li>When a file is created, it automatically gets the archive bit set.<\/li>\n<li>When a file has been changed, the operating system (especially Windows) automatically sets the archive bit of that file, if it is not already set.<\/li>\n<li>The incremental backup software checks if there are files which have the archive bit set. If it is set, this file is considered as being changed and thus it is stored in the current backup set. Once the backup has been executed successfully, the archive bit of all files backed up is reset\/cleared.<\/li>\n<\/ul>\n<p>Although at first glance this looks as a very efficient and appealing algorithm, it is not. There is one major drawback which I found during a short simulated backup restore drill: What about deleted files? They simply do not have archive bits!<\/p>\n<h4>Versioning and Backup Software<\/h4>\n<p>In my case it became evident that the backup software used exactly the algorithm stated above (and BTW this was the reason why it was so fast, one of the advertised &#8220;features&#8221; of this product). However, the non-existance of a file can have semantics, too. In our use case, the application for which the data was backed up uses a dedicated file for each object to indicate a distributed lock (also not a very intelligent approach, but that&#8217;s another story). Restoring all the incremental backups thus caused all old locks &#8220;reappearing&#8221;. Fortunately there was an automatism inside the application to detect stale locks, so no severe impact (besides an initial bad performance when the system cleaned up the mess) was observed. Shortly afterwards, some very fancy, inconsistent behaviour proved previous rumbles out of my stomache: Some time before (during the time frame of the incremental backups) the application moved files from one directory to another. As usual, a move was implemented as a copy from the source to the destination with a clean up afterwards at the source location. Thus, the archive bit was set properly (at the target position). As the backup ignored deletions, the &#8220;old version&#8221; at the point in time when the move was executed &#8220;reappeared&#8221;. Whenever the implementation of the application looked at the &#8220;previous location&#8221; (for example in case of compatibility) it &#8220;fell over&#8221; the old version which did not fit to the rest of the data it had read before. In short, looking at the backup application, this is an excellent example for the situation where non-completeness of a feature may cause significant inconsistency which may happen to be detected &#8220;too late&#8221;.<\/p>\n<h4>Versioning and Subversion<\/h4>\n<p>Some time ago I experienced similar issues on consistencies with another program (there it was not on files, but on directories). It was <a href=\"http:\/\/en.wikipedia.org\/wiki\/Concurrent_Versions_System\">Concurrent Versions System<\/a> or better known by its acronym CVS. At the end of this story I decided to switch the program and move to the next generation versioning tool called <a href=\"http:\/\/subversion.tigris.org\/\">Subversion<\/a> (svn). Subversion is an application which is exactly built for the purpose of versioning and uses the paradigm of database transactions to ensure <a href=\"http:\/\/en.wikipedia.org\/wiki\/ACID\">ACID<\/a>-like operations. By this it can even <a href=\"http:\/\/www.devx.com\/opensource\/Article\/27884\">deal with versioning of directories<\/a>. It belongs to the group of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Revision_control\">revision control<\/a> programs (more resource on this tool you can find the additional resource section at the end of this post). When I saw the similarity of those two approaches I wondered if it would not be possible to reuse Subversion also for providing historical information on backup strategies. Out of that experience I knew that svn did not suffer the same &#8220;stale file&#8221; issue as the backup program above had. And even better: Subversion has a very compelling diff algorithm for storing deltas between versions in a very compact manner &#8212; even in the case of binary files (even better: the application mentioned above mostly uses text files for data storage, so this was not a key differentiator). After all it provides an open-source license which makes it quite cost-efficient especially on acquisition. Yet, there were some minor obstacles.<br \/>\n<!--nextpage--><\/p>\n<h4>Added and Removed Files\/Directories<\/h4>\n<p>Out of experience I knew that it was necessary to &#8220;register&#8221; and &#8220;deregister&#8221; files at subversion before it can do its &#8220;magic&#8221;. To be more precise, there are commands for<\/p>\n<ul>\n<li>adding files\/directories,<\/li>\n<li>removing files\/directories and<\/li>\n<li>committing the changed state to the server (changed content of files is detected automatically during this call)<\/li>\n<\/ul>\n<p>As the third bullet item did solve the problem of changed files, it opened up the &#8220;can of worms&#8221; for both new and deleted files. At first glance this appears to be more step back than forth. However, there is support for this situation: the command line tool <code>svn<\/code> allows to provide details about the current state of the files. One may call <code>svn status<\/code> to query for a list of changes made locally. For each file the name and its status is printed. For example the values &#8220;?&#8221; indicates that a file (or directory) is available locally, but has not been uploaded\/added to the list of files to add on the next commit to the server. Similarly, the &#8220;!&#8221; indicates that a file has been deleted locally, but the deletion has not been queued to be sent to the server yet.<br \/>\nThe resolution of this matter was a simple batch file which looks as follows (derived from <a href=\"http:\/\/gael-varoquaux.info\/computers\/svnautocommit\/index.html#using-the-autocommit-script-under-windows\">svnautocommit<\/a>):<\/p>\n<pre lang=\"dos\">for \/f \"tokens=2*\" %%i in ('svn status %1 ^| find \"?\"') do svn add \"%%i %%j\"\r\nfor \/f \"tokens=2*\" %%i in ('svn status %1 ^| find \"!\"') do svn delete \"%%i %%j\"<\/pre>\n<p>(whereas <code>%1<\/code> denotes the directory where the local copy is being stored) To state it already right out of the box: This isn&#8217;t the most efficient way of doing so, but in my case it already did the job. Adding the command<\/p>\n<pre lang=\"dos\">svn commit -m \"Automatic Commit\" %1<\/pre>\n<p>afterwards around enabled this batch file to be executed automatically with the help of <a href=\"http:\/\/support.microsoft.com\/kb\/308569\">Windows Task Scheduler<\/a>.<\/p>\n<h4>Adoption to the Usecase and Long Lasting Commits<\/h4>\n<p>Having figured out this part of theory, I sprang into action: It ever rankled me that I did not copy the data &#8220;out&#8221; of the WinS server. As stated before the backup data was stored on a DVD+RW, but this media had to always remain in the drive at the server system (for the next backup). A better approach is to make sure that the data leaves the system and gets copied to some foreign location (in case of fire ideally this would also be another room \/ location which is not possible in my case, but even using another computer at least is better than nothing). Seeing a chance to change that I installed an Apache Webserver plus the subversion server package on the LinS. On the WinS I installed the <a href=\"http:\/\/subversion.tigris.org\/getting.html#windows\">Subversion command line tool<\/a> and uploaded the data to backup from the WinS to the LinS (it would have been better to use the <code>svn import<\/code> command, though I used the <code>svn add<\/code> command). During this activity I hit another unexpected problem: When apparently preparing to upload a large file to the HTTP repository during executing the <code>svn commit<\/code> command, the client (here, the WinS) required some time to calculate the difference which should be sent to the Subversion server via the Apache Webserver. In my case, however, the KeepAlive option of the webserver was enabled in \/etc\/httpd\/httpd.conf (which is a good idea as such), but the KeepAliveTimeout parameter there was set to 15 seconds. If the client took more than this leap time to send the next file to the server, the webserver closed the connection (due to this KeepAlive timeout) which in turn caused the client to quit uploading with a cryptic error message. I came to this idea when I thought about this behaviour having read through <a href=\"http:\/\/www.tty1.net\/blog\/2008-01-19-subversion-tuning_en.html\">this article<\/a>. In my case I changed the standard value for KeepAliveTimeout from 15 seconds to 150 seconds which did the trick &#8211; since then I never experienced commit stops anymore. Please note that this might be a bad idea, if your webserver also serves &#8220;real clients&#8221; such as normal web pages &#8211; in my scenario almost the sole purpose of the webserver is to deal with the Subversion requests, so I could make that change globally without risking anything.<br \/>\n<!--nextpage--><\/p>\n<h4>Tagging<\/h4>\n<p>To make my life easier and to find out which version was valid at which point in time I also implemented a <code>tags<\/code> subdirectory. After each backup commit to the Subversion server, I automatically store a tag directory of the current version by using the remote copy functionality of the subversion client (for details refer to <code>svn copy --help<\/code>, use case &#8220;URL -&gt; URL&#8221; there) knowing that this is not necessary with subversion. This is only for my personal convience as I do not like to specify dates when checking out data from a versioning system like Subversion.<br \/>\nPlease note that using the local copy function (use case &#8220;WC -&gt; WC&#8221;) is a bad idea in this case as on execution of the commit command also will copy the entire data locally on your filesystem, which will eat up your entire disk space over time quite quickly. Using the remote version makes the Subversion server creating a reference copy (lazy copy) and thus only takes some kilobytes for storing a reference in a transaction file, even if you have a large number of huge files stored in the repository. The downside of this is that this cannot be done together with the commit, but is executed as a seperate transaction on the Subversion server. However, this should not pose a major issue as the numbers of transaction is not limited practically.<\/p>\n<h4>The Script<\/h4>\n<p>If you want to set up a similar scenario you may benefit from the following references: <\/p>\n<ol>\n<li>Download this file: <p><img decoding=\"async\" src=\"http:\/\/blog.schmoigl-online.de\/wp-content\/plugins\/wp-downloadmanager\/images\/ext\/zip.gif\" alt=\"\" title=\"\" style=\"vertical-align: middle;\" \/>&nbsp;&nbsp;<strong><a href=\"http:\/\/blog.schmoigl-online.de\/?dl_id=2\">Subversion Backup Scripts<\/a><\/strong> (1.3 KiB, 1,214 hits)<\/p> (<a href=\"?p=21\">License<\/a> information for these files). It contains my set of scripts and a collection of subdirectories which might be handy for you. In any case make sure that you adjusted the repostory&#8217;s path in file <code>backup_run.cmd<\/code>.<\/li>\n<li>To run these programs you will need the <a href=\"http:\/\/subversion.tigris.org\/getting.html#windows\">Windows Subversion command line tool<\/a> installed on the system.<\/li>\n<\/ol>\n<h4>Further Aspects<\/h4>\n<ul>\n<li>If you do not have a linux box for backup at hand but another Windows system, you might find this blog on <a href=\"http:\/\/www.reloadedpc.com\/uncategorized\/setup-wamp-svn-subversion-windows\/\">installing WAMP with a Subversion server<\/a> useful.<\/li>\n<li>Please note that you might have a problem more now which you did not know before: How do you get rid of old versions? I will eloberate on this topic in a seperate posting.<\/li>\n<li>Whereever there is light, you also will find shadows. The dark side of this approach is that the entire data is stored twice on the backupee system: once for the original and possibly changed data and once again in the corresponding .svn subdirectories. This is a known behaviour of the subversion client which enables it to efficiently compare what has been touched before committing only the data which is necessary. In our case this, however, implies that you double the required disk space.<\/li>\n<li>besides the required disk space for those .svn subdirectories there is another drawback: In some rare cases the existance of this additional .svn subdirectory in each directory of your backup directory structure may pose problems or cause a different behaviour of the application. This may happen, for example, if the application wants to delete a otherwise empty directory, but unfortunately cannot because this .svn directory still is there. Thus those applications need to be programmed intelligent enough to ignore this  automatically.<\/li>\n<li>A very similar approach could also be achieved via the recent &#8220;Previous Version&#8221; functionality in the latest NTFS version. Microsoft refers to it as <a href=\"http:\/\/support.microsoft.com\/kb\/308569\">Volume Shadow Copy<\/a>. In my case it was not subject of discussion, because the version of the Operating System of the Windows Server is too old. Thus, I also did not play around with it yet.<\/li>\n<\/ul>\n<h4>Additional Resources<\/h4>\n<ul>\n<li>If you are interested in a quick tutorial on subversion, you might want to read <a href=\"http:\/\/aymanh.com\/subversion-a-quick-tutorial\">http:\/\/aymanh.com\/subversion-a-quick-tutorial<\/a><\/li>\n<li>For a discussion about the drawbacks and impact of using Subversion as a backup server you can also read <a href=\"http:\/\/stackoverflow.com\/questions\/61888\/using-subversion-for-general-purpose-backup\">http:\/\/stackoverflow.com\/questions\/61888\/using-subversion-for-general-purpose-backup<\/a>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Surprisingly well Subversion servers can be used also for implementing incremental backup strategies. As it incorporates a good difference handling algorithm and provides magnificant support on investigating versions, it automatically provides features which are ahead of any standard backup program. This post explains how I got to that view and describes the scenario in which it is in productive use.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-19","post","type-post","status-publish","format-standard","hentry","category-subversion-svn"],"_links":{"self":[{"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=\/wp\/v2\/posts\/19","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=19"}],"version-history":[{"count":72,"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=\/wp\/v2\/posts\/19\/revisions"}],"predecessor-version":[{"id":195,"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=\/wp\/v2\/posts\/19\/revisions\/195"}],"wp:attachment":[{"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=19"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=19"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.schmoigl-online.de\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=19"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}