IDE RAID Technology
By Duane Gustavus, UNIX Research Analyst
This paper documents an investigation of a new disk storage array strategy that offers an attractive price/storage point as compared to current commercial SAN or NAS products. The discussion includes a general description of the technology, practical details of implementation and an itemized price list.
One of the salient features of a great deal of contemporary scientific research is the reliance on significant amounts of computing power. The data that results from these computations must be stored for analysis and reference in later research. It is not surprising, then, that new terms like "disk farm" and "terabyte" are being added to the technical jargon of the day.
Current commercial answers to this vexing problem come in a variety of flavors, but share the general feature that significant amounts of storage come at significant prices. Because the definition of "significant" is at the fulcrum of the topic under consideration, I propose that one terabyte of disk space (1,000 gigabytes or one million megabytes) and $10,000 are both significant values for individual research projects. When changing the definition of significant (say one petabyte of storage or $1,000 investment), the options available for investigation might certainly be expected to change as well.
Unfortunately the terabyte/10k price point is not a hotly contested slot in the current commercial market. Some of the national supercomputer centers1 and other research groups2 have published results of their attempts to provide some solution near that price point, and those investigations lent impetus to a similar exercise here at UNT. The results presented here indicate the terabyte/10k price point is not wishful thinking, but there are also compromises that should be understood before building your own.
Many computer users already have experience using disk arrays, often referred to as RAIDs (an acronym for Redundant Arrays of Inexpensive Disks). The RAID concept was a reaction to the problems and expense involved with building continually larger (in terms of storage capacity) disk devices. If one could make several smaller devices appear functionally as a single disk drive, the problems of bit density and physical size could be neatly side stepped. This nut was cracked, and there are many varieties of RAID controllers on the market now, providing both increased reliability and performance.
RAIDs can be configured in a variety of ways which highlight the required compromise between data integrity and storage size. The trade-off is generally to increase data integrity by writing the data in more than one place (data redundancy) which in turn reduces the amount of storage available by a factor of the redundancy. The most obvious version of this trade-off is between RAID0 and RAID1. In RAID0, multiple disks are interleaved to appear as one disk which is the size of all disks combined (called striping); in RAID1, half the disks are used for data redundancy and are simply copies of the other half (called disk mirroring). RAID0 maximizes the amount of space available, but at the cost of reliability (the probability of disk failure increases with the number of disk drives, and any single drive failing breaks the entire array); RAID1 increases reliability (two disks would have to fail to loose data), but at the cost of total available storage which is cut in half. Some attempts have been made to provide a compromise between the two extremes with features of both. RAID10 is the combination of RAID0 and RAID1 where a set of striped disks are mirrored. This provides the redundancy of mirroring with the large file system sizes available to RAID0 configurations. Another approach to a compromise between size and redundancy is RAID5. In a RAID5 disk array, one drive can be considered logically as a parity drive (parity is a methodology for reconstructing data). When a single disk fails, the data can still be reconstructed from the parity information. This provides redundancy at a lower storage cost than mirroring, but has slower performance due to the nature of constructing and reconstructing the parity information.
Up to this point, RAID drive arrays have primarily been specialty products aimed at data centers where big systems (and big budgets) are the rule. One cost factor involved with these RAID products has been the type of disk drive supported. Due to design limitations, the IDE disk drives commonly used in desktop computers were not useful for RAID configurations, so more expensive SCSI-interfaced hard disks were and are the norm. This is unfortunate from a cost perspective because the economy of scale involved in the huge desktop computer market constantly drives down the cost of IDE drives while the competition for a share in that market has continually increased the performance.
Several companies have recently introduced IDE RAID controllers (sometimes called storage switches) which are designed to use the newer ATA100 and ATA133 specification IDE drives. This approach offers considerable savings due to the lower cost of the hardware involved and the necessity of establishing a new price point in order to compete with established vendors already controlling NAS/SAN market share. Nonetheless, entire IDE-based RAID disk systems are already appearing in commercial form, so you may be able to buy something similar to the system described in this paper by the time you read this, thus avoiding the hassles (and savings) of building your own. Even if you decide to buy a complete system from a commercial integrator, these experiences might be useful in helping you understand the compromises you must make.
In order to employ one of the new IDE RAID controllers, a server class computer system will be needed. The controller selected for this project is a 64-bit PCI card, so nothing exotic is required; a system was specified which is generally representative of current3 high-end desktop technology. The exceptional part of the specification is that a case/power supply combination must be selected which can meet the requirement of running at least twelve disk drives. These drives generate heat and can be expected to generate a substantial power surge when the system is first turned on. To reduce costs further, existing hardware could be used or other components selected, but the ability to power and cool a dozen disk drives is not a minor issue and should not be discounted.
Items on the original system purchase order are shown in the following table. These were reasonable prices at the time of order, but will almost assuredly have changed in price since this investigation:
Some required items were not on this PO because they were available locally (CD-ROM drive, network card, video monitor, mouse and keyboard). The video subsystem is not particularly important on a file server, so this is a good area to look for savings. The components actually delivered substituted an 80g Maxtor drive due to delivery problems with the higher density drive and the desire not to pend the project on delivery of that product (and at a price reduction to $229.99 each). With the 20% decrease in drive storage, the terabyte target slips to 800 gigabytes for this implementation, but the 100g drives seem to be available now (and possibly larger ATA-133 drives). The final component necessary for this investigation is the IDE RAID controller. The model selected was a product from the 3ware corporation named the Escalade Storage Switch model 7810. There are several others, and this study is not an endorsement of the 3ware product over any other.
This particular controller supports up to eight drives (which is why two were ordered) in various RAID configurations. Support is provided for IDE drives that meet the ATA/100 specification (or lower). The company also claims "hot swap" and "hot spare" capabilities, but these features have not been tested in this implementation. The controllers were purchased for a unit price of $385.00, and were the component (there's always one) which was the slowest piece to arrive, delaying the project by several weeks.
The system was constructed while awaiting delivery of the RAID controllers using one of the drives for system software with the motherboard's IDE controller. The software installed was the RedHat Linux 7.2 distribution which comes standard with 3ware drivers. The primary installation issue involved the physical layout of the case selected. In order to make the IDE disk cables reach from the controller to the disk drive bay, it was necessary to machine two "windows" in the slide-out tray that holds the motherboard for cable routing. This modification was a fairly trivial machining job if you have a machine shop, but cable routing should be carefully planned in advance if you don't. Most of these cases will be designed with SCSI cabling in mind, which means one or two fairly long cables with multiple connectors; for this controller, each disk drive as a separate 80-pin ribbon cable (supplied with the controller) which requires a little more forethought. In addition, the controller comes with a set of cables which "Y" the power connectors, but your case selection should provide enough power leads as an indication the manufacturer designed the system to support several drives. When the controllers arrived, one was placed in the system and connected to eight drives. The instructions provided with the controller are minimal, but there was no difficulty in putting the controller and drives into service.
The total cost of the final configuration as implemented, including shipping and handling charges, was $4998.83. This does not include keyboard,mouse, NIC, video monitor or miscellaneous hardware (disk drives do not come with mounting screws etc). The potential data storage capacity is 880g (one of the twelve drives is used for system software and therefore not placed on the RAID controller) for a price very close to $5,000.
The primary thrust of this implementation was low-cost, network accessible disk storage. To meet that system criterion, it was only necessary to see if file systems could be built on the array and then exported to the network. The initial tests were done with a configuration using one controller and eight disk drives. The drives are represented to the user as a single SCSI disk (ie the 3ware controller looks like a SCSI controller to Linux). In the first case, the entire capacity of all eight drives was used to build a single file system. From the user perspective, it looks like:
Notice the file system named 3W which shows up as device /dev/sda1 contains 615 gigabytes (the reported sizes are in one kilobyte blocks). The eight 80g drives have a maximum advertised storage of 640g, but the file system requires some of that storage (about four percent in this case) for metadata. Under the Available column, this file system has only 584g showing. That is the result of saving back five percent of the space in the file system which only root can utilize (this is a standard practice which can be modified with special directives). From the user perspective then, we have a file system with just over half a terabyte, however, notice that this is about 91% of the theoretical maximum storage available for this configuration (one controller, eight drives, 640g). In tests with different types of RAID, redundant storage will make 90% utilization of advertised space look very attractive indeed.
The file system is now operational, but not very useful because there is only local access. This file system can be exported for network access using NFS (Network File System). The following screen capture shows the mount command used to make the new file system available from a different machine by using the path /3W (in other words, the same path but on a different computer):
Comparing the two views of the file system will show that only the device name has changed with the NFS mount showing the hostname from which the file system was shared. The configuration which defines which file systems are exported (that is capable of being shared) also controls where they will be available (allows access control by host). In other words, the file system may be shared with only specific hosts, or entire subnets, or any system on the network capable of performing an NFS mount. It is important to remember when using NFS that file sharing in this fashion presumes a common user base (ie the user named linus on the file server will have the same user ID on all hosts that share the server's exported file systems).
The exercise was to provide large amounts of inexpensive disk space which was accessible over the network. To this point a file system of 584 usable gigabytes has been provided for a cost of $5,000, or about 116 megabytes per dollar (remember this cost includes the computer and all hardware necessary to use the disks). Having established a "beachhead" on the storage/price criterion, some benchmarks are in order to characterize the performance of the result.
The benchmark numbers provided in this section are intended to be neither authoritative nor exhaustive. In order to provide more reproducible results, the tests would need to be run over a private network in a much more controlled environment. On the other hand, it is often quite difficult to reproduce the performance documented in this type of benchmark testing outside a controlled environment. These tests were run over the UNT Academic Computing Services subnet during "off" hours (which means early morning here). In addition to the described system, some commercial alternatives were tested and the numbers provided for perspective. The environment was not equal for all tests simply because some of the systems are not available on the public network. This additional information is still useful to get the "flavor" of the compromises required.
To test disk performance, the publicly available bonnie benchmark was used. This program tests sequential read/write performance in both character and block mode, and also provides a random seek test. The file size used for these tests was 100 megabytes with results averaged over five runs as kilobytes/second. These test were run on the standard Linux file system type (ext2)4. The following table summarizes the test runs on tbyte5:
The test labeled IDE was run to the IDE disk attached to the standard IDE controller providing information about the performance capabilities of a single drive; the remainder were run on groups of disks attached to the RAID controller. The RAID0 configuration was a single stripe over eight disks with RAID5 also using a full eight disk array. The RAID10 and RAID10-NFS tests were both run with an array configuration using four drives (a two drive stripe which is mirrored). The RAID10-NFS test was run on a remote host using the RAID10 array over an NFS mount point. The two NFS tests are probably the most useful in terms of predicting performance because they were run over the UNT network. Both of the NFS tests exhibit the performance decrease characteristic of network file accesses. In addition, the RAID5-NFS figures show the combined effects of network access and the read-before-write design (two network accesses) typical of RAID5 configurations. The trade-off between RAID1 (or RAID10) and RAID5 is more obvious when the data size available to users is considered for eight disk arrays: 511g for RAID5 and 292g for RAID10. In other words, for redundant array types, the RAID5 will provide more storage6 at lower performance.
In order to get the flavor of the performance compromise between this approach and some commercial products, similar tests were run on two different "disk farms" employed by ACS in production mode. This is not an apples-apples comparison for several reasons. First, the network over which these tests were run is not the campus network (this should affect only the NFS test). In the case of the MetaStor NAS, the file system involved was a proprietary Veritas file system using RAID5. The Nstore is a SAN, so the "network" attachment is a fiber technology SCSI network and not the common campus network. Finally, the cost of the Metastor with 375g was about $130,000 while the Nstore with 500g was about $50,000 with an additional $2k/node charge for the fiber interface card..
Both of these systems provided generally improved performance on these tests, but at significantly increased price. Part of the compromise here is due to the projected use of the storage. With the MetaStor, the physical configuration will support many more disk drives than are currently employed (growth capacity). While the cost of these drives will be high, the hardware described in this project is simply not capable of the storage capacities which are supported by the MetaStor. In addition, some of the costs involved in the MetaStor are for large buffers which will sustain these performance numbers under much greater load. The Nstore SAN is is high-performance technology most comparable to a local disk drive because the data transport is not over a general purpose network. The performance comes at the cost of specialized interface cards and fiber cable connects which are an additional expense to the normal NIC already part of the system.
The search for storage space will become an increasingly common venture as more data is churned out by computational research in every field. In some cases where the amount of space is prioritized over performance concerns, the IDE RAID technology can provide impressive amounts of storage per dollar and, perhaps more importantly, is within reach of modest departmental or even project level budgets. The implementation is not particularly difficult if local UNIX competence is available, but could be an issue if that is not the case. This technology is new enough that long-term stability should be a concern, though there was no sign of any instability in the process of running these tests. In addition, NFS is not an appropriate solution in all contexts. If, however, the requirements can be met by this approach, a very impressive price/storage point is possible with an IDE RAID system.
The decision about which type of technology to employ should not be generalized into a "one size fits all" methodology. The data collected in this project argues that at the current time there are several viable approaches, each with its own set of compromises. The premise of deploying a system with "room to grow" should be greeted with special skepticism because the "upfront" costs of this growth potential are high, while the rate of technological change can easily devalue the usefulness of growing a two-year old technology.
About this document ...
IDE RAID Technology
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
The command line arguments were:
The translation was initiated by Duane Gustavus on