Skip Navigation Links
As mentioned in the "Copyright and Information Security" article in this issue, October is "National cyber security awareness month." In keeping with that theme, we decided to reprint this article from the July 2004 issue of Benchmarks Online.
Although the article is several years old, the information is still accurate. The numbers listed in the "Archive Systems" section are dynamic and therefore generally representative of how things are in 2007. For a technical discussion of backing up data, see this article Mr. Gustavus recently wrote for another venue. -- Ed.
By Duane Gustavus, Research Computing Support Manager
The promise of technology has always been double-edged. Implicit in the power to acquire and manipulate ever more information at the click of a mouse is the power to decimate a career's worth of data just as easily. In every computer-dominated discipline are tragic tales of the person who lost irreplaceable data, a freshly completed dissertation or critical spreadsheet due to "computer failure". Closer to home, most of us have lost a piece of E-mail or maybe a URL or address we now want. At times like these, the need for backups is not a bone of contention, but it is still often the case that a useful copy of the missing data is simply not available.
Given that situation, you might presume that a discussion of the value of backups is unnecessary; one must simply have them. While I agree, statements like this are rather like unfunded federal mandates; they assume the only relevant question is desirability, and thus avoid entirely the thorny issues of implementability which are, of course, at the root the problem. Let's pretend we're not the federal government, and are therefore willing to examine some of those icky implementation details.
When you assume . . .
Perhaps you feel that implementation issues are only for technical folks; that you may assume your backup requirements are already adequately addressed because you once heard the phrase "you are backed up". I certainly have no intention of questioning the competence of your support group, but would ask you to consider who values your data most highly? Who actually owns your data (we are a State institution remember)? Where will the inconvenience (or worse) fall if your data are irrecoverable? Is it possible that something you don't know about backups might bring grief, more specifically avoidable grief? Stick with me a few paragraphs and see if you don't at least come up with some useful questions concerning the persistence of your notoriously ephemeral digital data.
A backup Q&A
The first question to ask should be "Is my system being backed up?". If the answer to this question is no, then I hope you have no reason to consider the system "your" computer. This is not a question of property ownership, but rather the observation that any computer you can access should serve you equally well because there should be no hint of personalization on the system. When the current machine dies, you can conveniently move to another one. If, on the other hand, you do have some reason to prefer one computer over others, perhaps it is time to consider how you will recover what makes it uniquely valuable when it breaks. If you don't have access to a departmental backup service (or perhaps, as I will strongly hint later, even if you do), you might consider accepting the ultimate responsibility for your own data. Forgive me if this proposal sounds a bit radical.
If you answered in the affirmative to the first question, the next useful question would be "What is on the backups?". For instance, is that vital piece of E-mail you just read on backups? Possibly, but more probably the answer is "Not yet." Any backup policy has a window of exposure to loss of new data that is at least the period of time it takes to make the backup. In most cases, the exposure window is considerably longer than that because the facility must service several users in rotation. In addition, backups read virtually the entire disk, and so thrash the system quite a bit while they run, degrading interactive performance noticeably and possibly chewing up significant chunks of precious network bandwidth. For these reasons, at UNT most backups are made during the wee hours of the morning when users are few, and there is less contention for network bandwidth. In other words, that vital piece of E-mail is probably not on backups until the next day. If you accidentally remove it before then (through no fault of your own of course), it is probably gone.
Given these niggling details, however, my vital E-mail is on backups in the next day or so, right?" Perhaps, if the E-mail is stored in one of the areas on a system that is backed up. That's right; not all the disk space on every system is backed up. While you may not be aware of it, many systems are configured to have "scratch" areas on them which are intended for temporary storage of temporary data; other areas are assumed to be static where data never changes. It is quite common to avoid backing these areas up in order to reduce the burden on the backup systems. While you are probably not greatly concerned about burdening the backup hardware, the reality is that the folks who manage it have to be.
This is a recipe for real tragedy, but an example of those implementation details referred to above. There is no argument that an optimal backup strategy would store every bit of data the instant it appears in perpetuity, but that is rather like saying world peace can be achieved if everybody would stop shooting and just be nice to each other. These "solutions" ignore the axiomatic principal of competition for limited resources. This does not render them less desirable as goals, but serves to focus our attention on the compromises required by our current context.
What does "backup" mean, anyway?
To further roil already muddy waters, the term backup is subject to interpretations which are rarely compared for commonality. To most users a backup system means if you lose a file, be it vital E-mail or the results of your latest three-month calculation, it can be restored to pristine condition rapidly and with a minimum of hassle. That is, after all, what backups are for isn't it? Well, kind of. To your system administrator, backups are what will be used to restore your system if there are problems. In other words, a snapshot of the disk drive which will be used to rebuild the entire system after catastrophic failures, which are generally rare occurrences. It may seem that while these goals are a little different, they both require restoring data from backups, so in reality the same solution fits both requirements equally well.
Consider that the files on your computer system will number in the thousands to tens of thousands, including files that are years old to the ones you made yesterday (the ones you made today will not be on backups remember). If your system is backed up nightly, there will have been hundreds of copies made of some of the files; if the file was static, all copies will be the same, but if you changed things, there might well be many versions of that file, all using the same name. To restore the single version you want, your support folks will have to know precisely where on the file system that file was located, exactly what it's name was and the last date (to the day) that it was on the system before it disappeared. Remember, you are probably asking them to wade through hundreds of thousands of files, quite possibly millions (you share the backup system with other users whose files are on the same backup). Given the size of the job, it will not seem unreasonable to them to expect you to know exactly what you want.
Certainly, not all file restores are this complicated, but the point is the required resolution of information to retrieve the file you need may exceed your knowledge of it. Contrast this with the system administrator's use of backups when they need to rebuild your system. They only need to know the name of the system, and will recover everything from the latest snapshot they have. These two modes of use are often differentiated as backup systems and archive systems. Backup systems are optimized for recovery at the system level. Archive systems are optimized for recovery at the file level. Why aren't all systems designed for archive purposes since entire systems could be recovered file at a time? As you might expect, archive systems are much more complicated and expensive (they must check the haystack straw at a time to find your needle, and it's a really big haystack). Therefore, archive systems generally require special purpose hardware, high density removable storage media and specially trained personnel.
Take, for example, the CITC archive system. It is comprised of two tape silos which contain six tape drives each and robotic arms to move tape cartridges to and from the drives as needed. The library of tapes used in rotation for backups contains 300 volumes, and services 200 hosts nightly with an average data flow in the range of 5 terabytes weekly. If the required tape is in the library and a tape drive available (i.e. not busy doing backups), this system can access a specific file from it's backlog of two to three weeks worth of files in a matter of minutes to a few tens of minutes. The hardware and software costs for such a system (not including the costs of a machine room environment in which it can operate) will scare $500,000 dollars.
Currently there are approximately 12,000 hosts on the UNT network (the number is quite dynamic). You won't need statistics help you realize that a similar archive system for that many hosts would be difficult to fund. The College of Arts and Sciences Computing Support Services group takes a different approach to backups. They provide a "network drive" (a convenient way of mapping disk space from another machine into a drive letter on your Windows machine) to which you copy files you want backed up. They then take care of making tape backups of the server system that actually contains your copied data. [Go to http://www.cas.unt.edu/committees/cc/policies/backup/ to learn more about this implementation.]
Other areas no doubt have similar services optimized to their context. The vital point here is not which type of implementation is best, but rather that implementations differ, so you must query your network manager to understand the specifics of the available services.
What strategies can a single user deploy to "backup the backups"?
Given this volume of data, complexity of technology and division of labor between management domains, what strategies can a single user deploy to "backup the backups" and perhaps be able to sleep at night? In most cases, redundancy is your friend. When the library at Alexandria was burned to the ground by religious zealots, many important texts were irretrievably lost, now known only by their acknowledgement in texts that survived. The ones that survived were most often copies of the originals made painfully by amanuenses, a task even then unpalatable enough that it was generally accomplished only through an act of pious devotion. Copies are the thing.
If you are fortunate enough to have access to multiple networked computers, you could copy important files from one system to the other. This form of backup is especially useful for single files (that vital piece of E-mail). Your computer probably has (or certainly could have at moderate cost) a removable medium storage device. In olden times there was the floppy disk; you can amuse young people by descriptions of this precarious device, answering questions like "Why was it called a disc when it was square?" or "Why was it called a floppy when it was rigid?". Currently the writable CD-ROM is the backup medium of choice. You can store ~700 megabytes of data for up to a few years with reasonable success on these devices. The CD-R is cheap (currently about a quarter a piece in bulk quantities), ubiquitous, reasonably rugged and an international standard (ISO-9660) that can cross boundaries between operating systems and different vendor hardware.
Most operating environments have applications that make "burning" a CD simple enough for anyone concerned about their data to employ successfully. If you are unwilling to expend the effort to learn how to use such an application, perhaps you have an assistant who will not find the chore unapproachable (the use of an amanuensis historically led to errors creeping into the text unbeknownst to the author, but them's the breaks). Despite the RIAA's insistence that the public is not trustworthy enough to have access to technology that can easily and cheaply make perfect copies of digital information, you will rest much easier if you can take a CD copy home with you to store off-site in a safe place. If the data is irreplaceable, make a few and spread them around in case the zealots storm your offices.
There are other media which can be used for backup purposes that may be better adapted to your situation. If a proprietary format is employed, provided by a limited number of suppliers, you should be aware of "bit rot". Large data management operations like NASA have rooms full of 6250 BPI round tapes full of data. If you don't know what one of these looks like, check out a 1960's vintage science fiction movie where an evil computer is the villain represented as panels of blinking lights and a row of drives the size of your average refrigerator on which tapes spin malevolently from time to time. Much of the data gathered from the early satellite programs is stored on this type of media. While we expended huge amounts of effort to collect this data, some of it will doubtless be lost because the tapes are reaching end-of-life, and the tape devices have become difficult to come by. I have a bookshelf on one wall of my home occupied by 33 1/3 rpm albums which I can no longer play; you can probably still buy a cassette tape drive somewhere, but your eight track tapes are information storage detritus.
Maintaining access to your data will sooner or later depend on the ability to copy it to a newer storage medium. If you cannot do this cheaply and easily, and the "content providers" are trying to convince Congress that only "pirates" require this ability, your data will be exposed to the ravages of technological decay. The best bet is the most open, common format available; at all costs avoid coolness, avoid cutting edge, avoid new and improved, avoid proprietary. You need to be able to make lots of copies of the fruits of your labor cheaply and easily, and propagate those copies to the extent you feel the probability of loss is overwhelmed by your favorite end-of-the world scenario.
Finally, I should mention that redundancy has a dark side. In point of fact, security and redundancy are at odds with each other. The more copies of anything to be secured, the bigger the job becomes. Often the increased security risk is considered acceptable to obtain the decreased probability of loss, but not always. You are the only one that can make this call. In addition, multiple copies of data which are not identical can lead to the problem of determining which is the "real" copy. Version control systems are beyond the scope of this discussion, but are of great importance if your methodology involves successive refinement (developing software or writing documents for instance).
Postscript: "What did he say? What did he say?"
My editorial review board opined that a succinct summary of this advice might be useful if not particularly engaging:
See "Safeguarding Research Data" in this issue of Benchmarks Online for further information on this topic.