In order to decide how to get the most of your devices you need to know what technologies are available and their implications. As always there can be some tradeoffs with respect to speed, reliability, power, flexibility, ease of use and complexity.
Many of the techniques described below can be stacked in a number of ways to maximise performance and reliability, though at the cost of added complexity.
This is a method of increasing reliability, speed or both by using multiple disks in parallel thereby decreasing access time and increasing transfer speed. A checksum or mirroring system can be used to increase reliability. Large servers can take advantage of such a setup but it might be overkill for a single user system unless you already have a large number of disks available. See other documents and FAQs for more information.
For Linux one can set up a RAID system using either software (the md
module in the kernel), a Linux compatible controller card (PCI-to-SCSI) or a SCSI-to-SCSI controller. Check the documentation for what controllers can be used. A hardware solution is usually faster, and perhaps also safer, but comes at a significant cost.
A summary of available hardware RAID solutions for Linux is available at Linux Consulting.
SCSI-to-SCSI controllers are usually implemented as complete cabinets with drives and a controller that connects to the computer with a second SCSI bus. This makes the entire cabinet of drives look like a single large, fast SCSI drive and requires no special RAID driver. The disadvantage is that the SCSI bus connecting the cabinet to the computer becomes a bottleneck.
A significant disadvantage for people with large disk farms is that there is a limit to how many SCSI entries there can be in the /dev directory. In these cases using SCSI-to-SCSI will conserve entries.
Usually they are configured via the front panel or with a terminal connected to their on-board serial interface.
Some manufacturers of such systems are CMD and Syred whose web pages describe several systems.
PCI-to-SCSI controllers are, as the name suggests, connected to the high speed PCI bus and is therefore not suffering from the same bottleneck as the SCSI-to-SCSI controllers. These controllers require special drivers but you also get the means of controlling the RAID configuration over the network which simplifies management.
Currently only a few families of PCI-to-SCSI host adapters are supported under Linux.
The oldest and most mature is a range of controllers from DPT including SmartCache I/III/IV and SmartRAID I/III/IV controller families. These controllers are supported by the EATA-DMA driver in the standard kernel. This company also has an informative home page which also describes various general aspects of RAID and SCSI in addition to the product related information.
More information from the author of the DPT controller drivers (EATA* drivers) can be found at his pages on SCSI and DPT.
These are not the fastest but have a good track record of proven reliability.
Note that the maintenance tools for DPT controllers currently run under DOS/Win only so you will need a small DOS/Win partition for some of the software. This also means you have to boot the system into Windows in order to maintain your RAID system.
A very recent addition is a range of controllers from ICP-Vortex featuring up to 5 independent channels and very fast hardware based on the i960 chip. The Linux driver was written by the company itself which shows they support Linux.
As ICP-Vortex supplies the maintenance software for Linux it is not necessary with a reboot to other operating systems for the setup and maintenance of your RAID system. This saves you also extra downtime.
This is one of the latest entries which is out in early beta. More information as well as drivers are available at Dandelion Digital's Linux DAC960 Page.
Another very recent entry and currently in beta release is the Smart-2 driver.
IBM has released their driver as GPL.
A number of operating systems offer software RAID using ordinary disks and controllers. Cost is low and performance for raw disk IO can be very high. As this can be very CPU intensive it increases the load noticeably so if the machine is CPU bound in performance rather then IO bound you might be better off with a hardware PCI-to-RAID controller.
Real cost, performance and especially reliability of software vs. hardware RAID is a very controversial topic. Reliability on Linux systems have been very good so far.
The current software RAID project on Linux is the md
system (multiple devices) which offers much more than RAID so it is described in more details later.
RAID comes in many levels and flavours which I will give a brief overview of this here. Much has been written about it and the interested reader is recommended to read more about this in the Software RAID HOWTO.
There are also hybrids available based on RAID 0 or 1 and one other level. Many combinations are possible but I have only seen a few referred to. These are more complex than the above mentioned RAID levels.
RAID 0/1 combines striping with duplication which gives very high transfers combined with fast seeks as well as redundancy. The disadvantage is high disk consumption as well as the above mentioned complexity.
RAID 1/5 combines the speed and redundancy benefits of RAID5 with the fast seek of RAID1. Redundancy is improved compared to RAID 0/1 but disk consumption is still substantial. Implementing such a system would involve typically more than 6 drives, perhaps even several controllers or SCSI channels.
Volume management is a way of overcoming the constraints of fixed sized partitions and disks while still having a control of where various parts of file space resides. With such a system you can add new disks to your system and add space from this drive to parts of the file space where needed, as well as migrating data out from a disk developing faults to other drives before catastrophic failure occurs.
The system developed by Veritas has become the defacto standard for logical volume management.
Volume management is for the time being an area where Linux is lacking.
One is the virtual partition system project VPS that will reimplement many of the volume management functions found in IBM's AIX system. Unfortunately this project is currently on hold.
Another project is the Logical Volume Manager project that is similar to a project by HP.
md
Kernel PatchThe Linux Multi Disk (md) provides a number of block level features in various stages of development.
RAID 0 (striping) and concatenation are very solid and in production quality and also RAID 4 and 5 are quite mature.
It is also possible to stack some levels, for instance mirroring (RAID 1) two pairs of drives, each pair set up as striped disks (RAID 0), which offers the speed of RAID 0 combined with the reliability of RAID 1.
In addition to RAID this system offers (in alpha stage) block level volume management and soon also translucent file space. Since this is done on the block level it can be used in combination with any file system, even for fat
using Wine.
Think very carefully what drives you combine so you can operate all drives in parallel, which gives you better performance and less wear. Read more about this in the documentation that comes with md
.
Unfortunately The Linux software RAID has split into two trees, the old stable versions 0.35 and 0.42 which are documented in the official Software-RAID HOWTO and the newer less stable 0.90 series which is documented in the unofficial Software RAID HOWTO which is a work in progress.
A patch for online growth of <tt/ext2fs/ is available in early stages and related work is taking place at the <tt/ext2fs/ resize project at Sourceforge.
Hint: if you cannot get it to work properly you have forgotten to set the persistent-block
flag. Your best documentation is currently the source code.
Disk compression versus file compression is a hotly debated topic especially regarding the added danger of file corruption. Nevertheless there are several options available for the adventurous administrators. These take on many forms, from kernel modules and patches to extra libraries but note that most suffer various forms of limitations such as being read-only. As development takes place at neck breaking speed the specs have undoubtedly changed by the time you read this. As always: check the latest updates yourself. Here only a few references are given.
e2compr
is a package that extends ext2fs
with compression capabilities. It is still under testing and will therefore mainly be of interest for kernel hackers but should soon gain stability for wider use. Check the http://e2compr.memalpha.cx/e2compr/ name="e2compr homepage"> for more information. I have reports of speed and good stability which is why it is mentioned here.Access Control List (ACL) offers finer control over file access on a user by user basis, rather than the traditional owner, group and others, as seen in directory listings (drwxr-xr-x
). This is currently not available in Linux but is expected in kernel 2.3 as hooks are already in place in ext2fs
.
cachefs
This uses part of a hard disk to cache slower media such as CD-ROM. It is available under SunOS but not yet for Linux.
This is a copy-on-write system where writes go to a different system than the original source while making it look like an ordinary file space. Thus the file space inherits the original data and the translucent write back buffer can be private to each user.
There is a number of applications:
SunOS offers this feature and this is under development for Linux. There was an old project called the Inheriting File Systems (ifs
) but this project has stopped. One current project is part of the md
system and offers block level translucence so it can be applied to any file system.
Sun has an informative page on translucent file system.
It should be noted that Clearcase (now owned by Rational) pioneered and popularized translucent filesystems for software configuration management by writing their own UNIX filesystem.
This trick used to be very important when drives were slow and small, and some file systems used to take the varying characteristics into account when placing files. Although higher overall speed, on board drive and controller caches and intelligence has reduced the effect of this.
Nevertheless there is still a little to be gained even today. As we know, "world dominance" is soon within reach but to achieve this "fast" we need to employ all the tricks we can use .
To understand the strategy we need to recall this near ancient piece of knowledge and the properties of the various track locations. This is based on the fact that transfer speeds generally increase for tracks further away from the spindle, as well as the fact that it is faster to seek to or from the central tracks than to or from the inner or outer tracks.
Most drives use disks running at constant angular velocity but use (fairly) constant data density across all tracks. This means that you will get much higher transfer rates on the outer tracks than on the inner tracks; a characteristics which fits the requirements for large libraries well.
Newer disks use a logical geometry mapping which differs from the actual physical mapping which is transparently mapped by the drive itself. This makes the estimation of the "middle" tracks a little harder.
In most cases track 0 is at the outermost track and this is the general assumption most people use. Still, it should be kept in mind that there are no guarantees this is so.
tracks are usually slow in transfer, and lying at one end of the seeking position it is also slow to seek to.
This is more suitable to the low end directories such as DOS, root and print spools.
tracks are on average faster with respect to transfers than inner tracks and being in the middle also on average faster to seek to.
This characteristics is ideal for the most demanding parts such as swap
, /tmp
and /var/tmp
.
tracks have on average even faster transfer characteristics but like the inner tracks are at the end of the seek so statistically it is equally slow to seek to as the inner tracks.
Large files such as libraries would benefit from a place here.
Hence seek time reduction can be achieved by positioning frequently accessed tracks in the middle so that the average seek distance and therefore the seek time is short. This can be done either by using fdisk
or cfdisk
to make a partition on the middle tracks or by first making a file (using dd
) equal to half the size of the entire disk before creating the files that are frequently accessed, after which the dummy file can be deleted. Both cases assume starting from an empty disk.
The latter trick is suitable for news spools where the empty directory structure can be placed in the middle before putting in the data files. This also helps reducing fragmentation a little.
This little trick can be used both on ordinary drives as well as RAID systems. In the latter case the calculation for centring the tracks will be different, if possible. Consult the latest RAID manual.
The speed difference this makes depends on the drives, but a 50 percent improvement is a typical value.
The same mechanical head disk assembly (HDA) is often available with a number of interfaces (IDE, SCSI etc) and the mechanical parameters are therefore often comparable. The mechanics is today often the limiting factor but development is improving things steadily. There are two main parameters, usually quoted in milliseconds (ms):
After voice coils replaced stepper motors for the head movement the improvements seem to have levelled off and more energy is now spent (literally) at improving rotational speed. This has the secondary benefit of also improving transfer rates.
Some typical values:
Drive type Access time (ms) | Fast Typical Old --------------------------------------------- Track-to-track <1 2 8 Average seek 10 15 30 End-to-end 10 30 70
This shows that the very high end drives offer only marginally better access times then the average drives but that the old drives based on stepper motors are significantly worse.
Rotational speed (RPM) | 3600 | 4500 | 4800 | 5400 | 7200 | 10000 ------------------------------------------------------------------- Latency (ms) | 17 | 13 | 12.5 | 11.1 | 8.3 | 6.0
As latency is the average time taken to reach a given sector, the formula is quite simply
latency (ms) = 60000 / speed (RPM)
Clearly this too is an example of diminishing returns for the efforts put into development. However, what really takes off here is the power consumption, heat and noise.
There is also a Linux Yoke Driver available in beta which is intended to do hot-swappable transparent binding of one Linux block device to another. This means that if you bind two block devices together, say /dev/hda
and /dev/loop0
, writing to one device will mean also writing to the other and reading from either will yield the same result.
One of the advantages of a layered design of an operating system is that you have the flexibility to put the pieces together in a number of ways. For instance you can cache a CD-ROM with cachefs
that is a volume striped over 2 drives. This in turn can be set up translucently with a volume that is NFS mounted from another machine. RAID can be stacked in several layers to offer very fast seek and transfer in such a way that it will work if even 3 drives fail. The choices are many, limited only by imagination and, probably more importantly, money.
There is a near infinite number of combinations available but my recommendation is to start off with a simple setup without any fancy add-ons. Get a feel for what is needed, where the maximum performance is required, if it is access time or transfer speed that is the bottle neck, and so on. Then phase in each component in turn. As you can stack quite freely you should be able to retrofit most components in as time goes by with relatively few difficulties.
RAID is usually a good idea but make sure you have a thorough grasp of the technology and a solid back up system.