Since a filesystem tries to work as asynchronous as possible, in order to avoid hard-disk bottleneck, a sudden interruption of its work could result in a loss of data. As an example, let's consider the following scenario: what happens if your machine crashes when you are working on a document residing on a Linux standard ext2 filesystem?
There are several answers:
The standard Linux filesystem (ext2fs) makes an attempt to prevent and recover from the metadata corruption case performing an extensive filesystem analysis (fsck) during bootup. Since ext2fs incorporates redundant copies of critical metadata, it is extremely unlikely for that data to be completely lost. The system figures out where the corrupt metadata is, and then either repairs the damage by copying from the redundant version or simply deletes the file or files whose metadata is affected.
Obviously, the larger is the filesystem to check, the longer the check process. On a partition of several gigabytes it may take a great deal of time to check the metadata during bootup.
As Linux begins to take on more complex applications, on larger servers, and with less tolerance for downtime, there is a need for more sophisticated filesystems that do an even better job of protecting data and metadata.
The journalling filesystems available for Linux are the answer to this need.
Most modern filesystems use journalling techniques borrowed from the database world to improve crash recovery. Disk transactions are written sequentially to an area of disk called journal or log before being written to their final locations within the filesystem.
Implementations vary in terms of what data is written to the log. Some implementations write only the filesystem metadata, while others record all writes to the journal.
Now, if a crash happens before the journal entry is committed, then the original data is still on the disk and you lost only your new changes. If the crash happens during the actual disk update (i.e. after the journal entry was committed), the journal entry shows what was supposed to have happened. So when the system reboots, it can simply replay the journal entries and complete the update that was interrupted.
In either case, you have valid data and not a trashed partition. And since the recovery time associated with this log-based approach is much shorter, the system is on line in few seconds.
It is also important to note that using a journalling filesystem does not entirely obsolete the use of filesystem checking programs (fsck). Hardware and software errors that corrupt random blocks in the filesystem are not generally recoverable with the transaction log.
The first one is ext3. Developed by Stephen Tweedie, a leading Linux kernel developer, ext3 adds journalling into ext2. It is available in alpha form at ftp.linux.org.uk/pub/linux/sct/fs/jfs/.
Namesys has a journalling filesystem under development called ReiserFS. It is available at www.namesys.com.
SGI has released on May 1 2001 version 1.0 of its XFS filesystem for Linux. You can find it at oss.sgi.com/projects/xfs/.
In this article these three solutions are tested and benchmarked using two different programs.
The ext3 filesystem is directly derived from its ancestor, ext2. It has the valuable characteristic to be absolutely backward compatible to ext2 since it is just an ext2 filesystem with journalling. The obvious drawback is that ext3 doesn't implement any of the modern filesystem features which increase data manipulation speed and packing.
ext3 comes as a patch of 2.2.19 kernel, so first of all, get a linux-2.2.19 kernel from ftp.kernel.org or from one of its mirrors. The patch is available at ftp.linux.org.uk/pub/linux/sct/fs/jfs or ftp.kernel.org/pub/linux/kernel/people/sct/ext3 or from one mirror of this site.
From one of these sites you need to get the following files:
mv linux linux-old tar -Ixvf linux-2.2.19.tar.bz2 tar -Ixvf ext3-0.0.7a.tar.bz2 cd linux cat ../ext3-0.0.7a/linux-2.2.19.kdb.diff | patch -sp1 cat ../ext3-0.0.7a/linux-2.2.19.ext3.diff | patch -sp1The first diff is copy of SGI's kdb kernel debugger patches. The second one is the ext3 filesystem.
After the kernel is compiled and installed you should make and install the e2fsprogs:
tar -Ixvf e2fsprogs-1.21-WIP-0601.tar.bz2 cd e2fsprogs-1.21 ./configure make make check make installThat's all. The next step is to make an ext3 filesystem in a partition. Reboot with the new kernel. Now you have two options: make a new journalling filesystem or journal an existing one.
mke2fs -j /dev/xxxwhere /dev/xxx is the device where you would create the ext3 filesystem. The "-j" flag tells mke2fs to create an ext3 filesystem with a hidden journal. You could control the size of the journal using the optional flag -Jsize=<n> (n is the preferred size of the journal in Mb).
tune2fs -j /dev/xxxYou should do that either on mounted or unmounted filesystem. If the filesystem is mounted a file .journal is created in the top-level directory of the filesystem; if it is unmounted a hidden system inode is used for the journal. In such a way all the data in the filesystem are preserved.
mount -t ext3 /dev/xxx /mount_dirSince ext3 is basically ext2 with journalling, a cleanly unmounted ext3 filesystem could be remounted as ext2 without any other commands.
XFS is a journalling filesystem for Linux available from SGI. It is a mature technology that has been proven on IRIX systems as the default filesystem for all SGI customers. XFS is licensed under GPL.
XFS Linux 1.0 is released for the Linux 2.4 kernel, and I tried the 2.4.2 patch. So the first step is to acquire a linux-2.4.2 kernel from one mirror of kernel.org.
The patches are at oss.sgi.com/projects/xfs/download/Release-1.0/patches. From this directory download:
mv linux linux-old tar -Ixf inux-2.4.2.tar.bz2Copy each patch in the top directory of your linux source tree (i.e. /usr/src/linux) and apply them:
zcat patchfile.gz | patch -p1Then configure the kernel, enabling the options "XFS filesystem support" (CONFIG_XFS_FS) and "Page Buffer support" (CONFIG_PAGE_BUF) in the filesystem section. Note that you will also need to upgrade the following system utilities to these versions or later:
tar -zxf xfsprogs-1.2.0.src.tar.gz cd xfsprogs-1.2.0 make configure make make installAfter installing this set of commands you can create a new XFS filesystem with the command:
mkfs -t xfs /dev/xxxOne important option that you may need is "-f" which will force the creation of a new filesystem, if a filesystem already exists on that partition. Again, note that this will destroy all data currently on that partition:
mkfs -t xfs -f /dev/xxxYou can then mount the new filesystem with the command:
mount -t xfs /dev/xxx /mount_dir
ReiserFS has been in the official Linux kernel since 2.4.1-pre4. You always need to get the utils (e.g. mkreiserfs to create ReiserFS on an empty partition, the resizer, etc.).
The up-to-date ReiserFS version is available as a patch against either 2.2.x and 2.4.x kernels. I tested the patch against 2.2.19 Linux kernel.
The first step, as usual, is to get a linux-2.2.19.tar.bz2 standard kernel from a mirror of kernel.org. Then get the reiserfs 2.2.19 patch. At present time the last patch is 3.5.33.
Please note that, if you choose to get the patch against 2.4.x kernel, you should get also the utils tarball reiserfsprogs-3.x.0j.tar.gz.
Now unpack the kernel and the patch. Copy the tarballs in /usr/src and move the linux directory to linux-old; then run the commands:
tar -Ixf linux-2.2.19.tar.bz2 bzcat linux-2.2.19-reiserfs-3.5.33-patch.bz2 | patch -p0Compile the Linux kernel setting reiserfs support on filesystem section.
cd /usr/src/linux/fs/reiserfs/utils make make installInstall the new kernel and reboot. Now you can create a new reiserfs filesystem with the command:
mkreiserfs /dev/xxxxand mount it:
mount -t reiserfs /dev/xxx /mount_dir
The next step is a benchmark analysis using bonnie++ program, available at www.coker.com.au/bonnie++. The program tests database type access to a single file, and it tests creation, reading, and deleting of small files which can simulate the usage of programs such as Squid, INN, or Maildir-format programs (qmail).
The benchmark command was:
bonnie++ -d/work1 -s10 -r4 -u0which executes the test using 10Mb (-s10) in the filesystem mounted in /work1 directory. So, before launching the benchmark, you must create the requested filesystem on a partition and mount it on /work1 directory. The other flags specify the RAM amount in Mb (-r4) and the user (-u0, i.e. run as root).
The results are shown in the following table.
Sequential Output | Sequential Input | Random Seeks |
|||||||||||
Size:Chunk Size | Per Char | Block | Rewrite | Per Char | Block | ||||||||
K/sec | % CPU | K/sec | % CPU | K/sec | % CPU | K/sec | % CPU | K/sec | % CPU | / sec | % CPU | ||
ext2 | 10M | 1471 | 97 | 14813 | 67 | 1309 | 14 | 1506 | 94 | 4889 | 15 | 309.8 | 10 |
ext3 | 10M | 1366 | 98 | 2361 | 38 | 1824 | 22 | 1482 | 94 | 4935 | 14 | 317.8 | 10 |
xfs | 10M | 1206 | 94 | 9512 | 77 | 1351 | 33 | 1299 | 98 | 4779 | 80 | 229.1 | 11 |
reiserfs | 10M | 1455 | 99 | 4253 | 31 | 2340 | 26 | 1477 | 93 | 5593 | 26 | 174.3 | 5 |
Sequential Create | Random Create | ||||||||||||
Num Files | Create | Read | Delete | Create | Read | Delete | |||||||
/ sec | % CPU | / sec | % CPU | / sec | % CPU | / sec | % CPU | / sec | % CPU | / sec | % CPU | ||
ext2 | 16 | 94 | 99 | 278 | 99 | 492 | 97 | 95 | 99 | 284 | 100 | 93 | 41 |
ext3 | 16 | 89 | 98 | 274 | 100 | 458 | 96 | 93 | 99 | 288 | 99 | 97 | 45 |
xfs | 16 | 92 | 99 | 251 | 96 | 436 | 98 | 91 | 99 | 311 | 99 | 90 | 41 |
reiserfs | 16 | 1307 | 100 | 8963 | 100 | 1914 | 99 | 1245 | 99 | 9316 | 100 | 1725 | 100 |
Two data are shown for each test: the speed of the filesystem (in K/sec) and the CPU usage (in %). The higher the speed the better the filesystem. The opposite is true for the CPU usage.
As you can see reiserFS reports a hands down victory in managing files (section Sequential Create and Random Create), overwhelming its opponents by a factor higher than 10. In addition to that is almost as good as the other filesystem in the Sequential Output and Sequential Input. There isn't any significant difference among the other filesystems. XFS speed is similar to ext2 filesystem, and ext3 is, as expected, a little slower than ext2 (it is basically the same thing, and it wastes some time during the journalling calls).
As a last test I get the mongo benchmark program available at reiserFS benchmark page at www.namesys.com, and I modified it in order to test the three journalling filesystems. I inserted in the mongo.pl perl script the commands to mount the xfs and ext3 filesystem and to format them. Then I started a benchmark analysis.
The script formats partition /dev/xxxx, mounts it and runs given number of processes during each phase: Create, Copy, Symlinks, Read, Stats, Rename and Delete. Also, the program calculates fragmentation after Create and Copy phases:
Fragm = number_of_fragments / number_of_filesYou can find the same results in the directory results in the files:
log - raw results log.tbl - results for compare program log_table - results in table formThe tests was executed as in the following example:
mongo.pl ext3 /dev/hda3 /work1 logext3 1where ext3 must be replaced by reiserfs or xfs in order to test the other filesystems. The other arguments are the device to mount, where the filesystem to test is located, the mounting directory, the filename where the results are stored and the number of processes to start.
In the following tables there are the results of this analysis. The data reported is time (in sec). The lower the value, the better the filesystem. In the first table the median dimension of files managed is 100 bytes, in the second one it is 1000 bytes and in the last one 10000 bytes.
ext3 files=68952 size=100 bytes dirs=242 |
XFS files=68952 size=100 bytes dirs=241 |
reiserFS files=68952 size=100 bytes dirs=241 |
|
Create | 90.07 | 267.86 | 53.05 |
Fragm. | 1.32 | 1.02 | 1.00 |
Copy | 239.02 | 744.51 | 126.97 |
Fragm. | 1.32 | 1.03 | 1.80 |
Slinks | 0 | 203.54 | 105.71 |
Read | 782.75 | 1543.93 | 562.53 |
Stats | 108.65 | 262.25 | 225.32 |
Rename | 67.26 | 205.18 | 70.72 |
Delete | 23.80 | 389.79 | 85.51 |
ext3 files=11248 size=1000 bytes dirs=44 |
XFS files=11616 size=1000 bytes dirs=43 |
ReiserFS files=11616 size=1000 bytes dirs=43 |
|
Create | 30.68 | 57.94 | 36.38 |
Fragm. | 1.38 | 1.01 | 1.03 |
Copy | 75.21 | 149.49 | 84.02 |
Fragm. | 1.38 | 1.01 | 1.43 |
Slinks | 16.68 | 29.59 | 19.29 |
Read | 225.74 | 348.99 | 409.45 |
Stats | 25.60 | 46.41 | 89.23 |
Rename | 16.11 | 33.57 | 20.69 |
Delete | 6.04 | 64.90 | 18.21 |
ext3 files=2274 size=10000 bytes dirs=32 |
XFS files=2292 size=10000 bytes dirs=31 |
reiserFS files=2292 size=10000 bytes dirs=31 |
|
Create | 27.13 | 25.99 | 22.27 |
Fragm. | 1.44 | 1.02 | 1.05 |
Copy | 55.27 | 55.73 | 43.24 |
Fragm. | 1.44 | 1.02 | 1.12 |
Slinks | 1.33 | 2.51 | 1.43 |
Read | 40.51 | 50.20 | 56.34 |
Stats | 2.34 | 1.99 | 3.52 |
Rename | 0.99 | 1.10 | 1.25 |
Delete | 3.40 | 8.99 | 1.84 |
From these tables you can see that ext3 is usually faster in Stats Delate and Rename, while reiserFS wins in Create and Copy. Also note that the performance of reiserFS in better in the first case (small files) as expected by its technical documentation.
Considering the benchmark results my advice is to install a reiserFS filesystem in the future (I'll surely do it).