GrabDuck

Performance - Linux Raid Wiki

:

I have made some testing of performance of different types of RAIDs, with 2 disks involved. I have used my own home grown testing methods, which are quite simple, to test sequential and random reading and writing of 200 files of 40 MB. The tests were meant to see what performance I could get out of a system mostly oriented towards file serving, such as a mirror site.

My configuration was

   1800 MHz AMD Sempron(tm) Processor 3100+
   1500 MB RAM
   nVidia Corporation CK804 Serial ATA Controller
   2 x  Hitachi Ultrastar A7K 1000 SCSI-II 1 TB.
   Linux version 2.6.12-26mdk
   Tester: Keld Simonsen, keld@dkuug.dk

Figures are in MB/s, and the file system was ext3. The chunk size was 256 kiB. Times were measured with iostat, and an estimate for steady performance was taken. The times varied quite a lot over the different 10 second intervals, for example the estimate 155 MB/s ranged from 135 MB/s to 163 MB/s. I then looked at the average over the period when a test was running in full scale (for example all processes started, and none stopped).

   RAID type      sequential read     random read    sequential write   random write
   Ordinary disk       82                 34                 67                56
   RAID0              155                 80                 97                80
   RAID1               80                 35                 72                55
   RAID10,n2           79                 56                 69                48
   RAID10,f2          150                 79                 70                55

Random read for RAID1 and RAID10,n2 were quite unbalanced, almost only coming out of one of the disks.

The results are quite as expected:

RAID0 and RAID10,f2 reads are double speed compared to ordinary file system for sequential reads (155 vs 82) and more than double for random reads (80 vs 35).

Writes (both sequential and random) are roughly the same for ordinary disk, RAID1, RAID10 and RAID10,f2, around 70 MB/s for sequential, and 55 MB/s for random.

Sequential reads are about the same (80 MB/s) for ordinary partition, RAID1 and RAID10.

Random reads for ordinary partition and RAID1 is about the same (35 MB/s) and about 50 % higher for RAID10. I am puzzled why RAID10 is faster than RAID1 here.

All in all RAID10,f2 is the fastest mirrored RAID for both sequential and random reading for this test, while it is about equal with the other mirrored RAIDs when writing.

My kernel did not allow me to test RAID10,o2 as this is only supported from kernel 2.6.18.

Remark from keld: The tests reported by Mathias B below is actually carried out in an environment with almost 100 % CPU utilization, so I am not sure how enlightening the numbers are.

Mathias B posted some benchmarks to the mailing list with this setup:

   Motherboard: Zotac ION Synergy DDR2 (Atom 330 overclocked to 2GHz, 667 FSB)
   RAM: 4GB DDR2 PC5300
   SATA controller: 05:00.0 SCSI storage controller: HighPoint Technologies, Inc. RocketRAID 230x 4 Port SATA-II             Controller (rev 02)
   SATA controller: nVidia Corporation MCP79 AHCI Controller (rev b1)
   Hard drives:
   Model=WDC WD20EARS-00MVWB0, FwRev=51.0AB51
   Model=WDC WD20EARS-00MVWB0, FwRev=50.0AB50
   Model=WDC WD20EARS-00MVWB0, FwRev=50.0AB50
   Model=SAMSUNG HD204UI, FwRev=1AQ10003
   Model=WDC WD20EARS-00MVWB0, FwRev=51.0AB51
   Model=SAMSUNG HD204UI, FwRev=1AQ10003

3 of these are connected to the PCI-E (1.0) SATA HBA and thereby bottle necked. OS is Archlinux 64-bit, kernel 2.6.37.1 & mdadm 3.1.4.

md details:

   /dev/md0:
   Version : 1.2
   Creation Time : Tue Oct 19 08:58:41 2010
   Raid Level : raid5
   Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
   Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
   Raid Devices : 6
   [...]
   Layout : left-symmetric
   Chunk Size : 64K

ext4 details:

   RAID stride:              16
   RAID stripe width:        80
   rw,noatime,barrier=1,stripe=80,data=writeback

block details:

   md0 block readahead 65536
   lvm lv block readahead 16384
   /sys/block/md0/md/stripe_cache_size 16384
   /sys/block/md0/queue/read_ahead_kb 65536

The 6 HDDs are in a RAID5 array, with LVM on top and then ext4 on top of that. This is a very low end system so many results are bottlenecked by the CPU. NCQ enabled on all drives. Here are some bonnie++ results:

   		------Sequential Output------			--Sequential Input-	    --Random-		------Sequential Create------			--------Random Create--------
   		-Per Chr-	--Block--	-Rewrite-	-Per Chr-	--Block--   --Seeks--		-Create--	--Read---	-Delete--	-Create--	--Read---	-Delete--
   Machine Size	K/sec	%CP	K/sec	%CP	K/sec	%CP	K/sec	%CP	K/sec	%CP   /sec	%CP	/sec	%CP	/sec	%CP	/sec	%CP	/sec	%CP	/sec	%CP	/sec	%CP
   ion 7G	16086	90	161172	78	69515	34	17887	97	258229   43   424	2	25336	98	+++++	+++	31240	99	26047	99	+++++	+++	31072	99
   ion 7G	17434	98	184348	89	147077	73	18282	99	382536   51   465.3	2	25401	99	+++++	+++	28185	90	26125	99	+++++	+++	31352	100
   ion 7G	17303	98	186702	90	142494	70	18310	99	356491   49   467.7	2	25447	99	+++++	+++	28446	90	25677	99	+++++	+++	31305	99
   ion 7G	17322	98	171309	89	146962	74	18348	99	365177   51   456.9	2	20419	80	+++++	+++	31455	100	25966	99	+++++	+++	31096	99
   ion 7G	17359	98	184704	91	123822	57	18282	99	375295   49   463.2	2	24908	98	+++++	+++	31465	99	25969	99	+++++	+++	31285	99
   ion 7G	17310	98	182821	90	124661	58	18347	99	385963   52   459.5	2	24710	98	+++++	+++	31840	99	26162	98	+++++	+++	31502	99

They vary a bit because I played with readahead and cache settings, ultimately I ended up with the settings posted above the results.

Durval Menezes repeated the above benchmark in Oct 2008 with a different configuration (3 500GB SATA disks) and a newer kernel (2.6.24), and also included the RAID10,o2 mode, reaching very similar conclusions (nothing beats RAID10,f2 overall).

Nat Makarevitch made an extensive benchmark for database with 6 and 10 spindles.

Justin Piszcz made a comparison in March 2008 with bonnie++ test of raid10,f2 n2 o2 raid5 of 10 Raptor drives, in a vanilla and an optimized version.

Conway S. Smith made a bonnie++ comparison in March 2008 raid5 and raid6 with 4 drives and varying Chunk Sizes.

Justin Piszcz made a comparison in May 2008 with bonnie++ test of raid levels 0 1 4 5 6 10,f2 10,n2 10,o2 of 6 SATA drives.

In Dec 2007 Jon Nelson made a test of raid levels 0, 5 and 10 f2 n2 o2 for sequential read and write, also in degraded mode, on 3 SATA drives.

Bill Davidsen reported Feb 2008: This is what I measure running an E6600 CPU and 3xSeagate 320 with Recent FC7 kernel. All reads and writes to the raw array using dd, 1MB buffer, 1GB i/o to/from /dev/{zero,null} for raw speed. Units are MB/s, 64k chunks, speed as reported by dd.

RAID lvl        read        write
0               110         143
1                52.1        49.5
10               79.6        76.3
10f2            145          64.5
raw one disk     53.5        54.7

Keld's remarks: the raid0 read figure of 110 MB/s is not consistent with other benchmarks that report about cumulative performance for sequential reads for raid0. I would have expected a figure around 150 MB/s here.

In July 2008, Jon Nelson made test of levels 5, 6, and 10, with differing chunk sizes from 64 kiB to 2 MiB on 4 SATA drives. This was for sequential read and writes on the raw raid types. With a file system, differences in write performance would probably smoothen out write differences due to the effect of the IO scheduler (elevator). Results include high performance of raid10,f2 - around 3.80, and high performance of raid5 and raid6. Especially with bigger chunk sizes 512 kiB - 2 MiB, raid5 obtained maximum 3.44 times the speed of a single drive, and raid6 got a factor 2.96. This is probably due to the even distribution of parity chunks, that means that reads are distributed evenly on all involved (4) drives.

In July 2008, Ben Martin made a test comparing HW and SW raid with 6 disks. The HW raid was a quite expensive USD 800 Adaptec SAS-31205 PCI Express 12-SATA-port PCI-E x8 hardware RAID card. Compared raid types included 5, 6 and 10,n2. Some conclusions: The difference is not big between the expensive HW raid controller and Linux SW RAID. For raid5 Linux was 30 % faster (440 MB/s vs 340 MB/s) for reads. For writes Adaptec was about 25 % faster (220 MB/s vs 175 MB/s). Keld's remarks: The Adaptec controller actually slowed down disk reading, the single disk read on the motherboard was 90 MB/s while via the Adaptec controller it was only 70 MB/s. Writing was faster, tho. raid10,f2 and raid10,o2 was not included in the test. Ben reported the Adaptec controller to give around 310 MB/s read performance for raid1, while a raid10,f2 would probably have given around 400 MB/s via the Adaptec controller, and around 600 MB/s with the 6 disks on the motherboard SATA controller and a reasonable extra controller, given other benchmarks as noted in this section, giving raid10,f2 a 95 % utilization of the cumulated IO bandwidth. For raid5 the read/write difference could possibly be explained by which chunk size to use, in Linux raid5 reading improves with bigger chunk sizes, while writing degrades.

In 2009 A Comparison of Chunk Size for Software RAID-5 was done by Rik Faith with chunk sizes of 4 KiB to 64 MiB. It was found that chunk sizes of 128 KiB gave the best overall performance. The test was done on a Supermicro AOC-SAT2-MV8 controller with 8 SATA II ports, and connected to a 32-bit PCI slot, which could explain the 130 MB/s max found.

A benchmark comparing chunk sizes from 4 to 1024 KiB on various RAID types (0, 5, 6, 10) was made in May 2010. For RAID types 5 and 6 it seems like a chunk size of 64 KiB is optimal, while for the other RAID types a chunk size of 512 KiB seemed to give the best results. The test were done on a controller which had an upper limit on about 350 MB/s. It is unclear which layout was used with RAID-10.

Sometimes there are apparent pauses in the stream of IO requests to the array component devices. The usual workaround is trying 'blockdev --setra 65536 /dev/mdN' and see if sequential reads improve. Also the stripe-cache_size is important for raid5 and raid6, and NCQ of the controller can interfere with the Linux kernel optimizations.

Here are some commands to alter default settings:

# Set read-ahead.
echo "Setting read-ahead to 32 MiB for /dev/md3"
blockdev --setra 65536 /dev/md3
# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size
# Disable NCQ on all disks. (for raptors it increases the speed 30-40MiB/s)
echo "Disabling NCQ on all disks..."
for i in $DISKS
do
  echo "Disabling NCQ on $i"
  echo 1 > /sys/block/"$i"/device/queue_depth
done

One good way to see what is actually happening is to use either 'watch iostat -k 1 2' and look at the load on the individual MD array component devices, or use 'sysctl vm/block_dump=1' and look at the addresses being read or written.

There can be a number of bottlenecks other than the disk subsystem that hinders you in getting full performance out of your disks.

One is the PCI bus. Older PCI bus has a 33 MHz cycle and a 32 bit width, giving a maximum bandwidth of about 1 Gbit/s, or 133 MB/s. This will easily cause trouble with newer SATA or PATA disks which easily gives 70-90 MB/s each. So do not put your SATA controllers on a 33 MHz PCI bus.

The 66 MHz 64-bit PCI bus is capable of handling about 4 Gbit/s, or about 500 MB/s. This can also be a bottleneck with bigger arrays, eg a 6 drive array will be able to deliver about 500 MB/s, and maybe you want also to feed a gigabit ethernet card - 125 MB/s, totalling potentially 625 MB/s on the PCI bus.

The PCI (and PCI-X) bus is shared bandwidth, and may operate at lowest common denominator. Put a 33Mhz card in the PCI bus, and not only does everything operate at 33 Mhz, but all of the cards compete. Grossly simplified, if you have a 133 Mhz card and a 33 Mhz card in the same PCI bus, then that card may operate at 16 Mhz. Your motherboards' embedded Ethernet chip and disk controllers may be "on" the PCI bus, so even if you have a single PCI controller card, and a multiple-bus motherboard, then it may make a difference what slot you put the controller in.

If this isn't bad enough, then consider the consequences of arbitration. All of the PCI devices have to constantly negotiate between themselves to get a chance to compete against all of the other devices attached to other PCI busses to get a chance to talk to the CPU and RAM. As such, every packet your Ethernet card picks up could temporarily suspend disk I/O if you don't configure things wisely.

The PCI-Express bus v1.1 has a limit of 250 MB/s per lane per direction, and that limit can easily be hit eg by a 4-drive array, or even just 2 velociraptor disks.

Many newer SATA controllers are on-board and do not use the PCI bus, but are rather connected directly to the southbridge, even for the cheapest motherboards. Anyway bandwidth is limited, but it is probably different from motherboard to motherboard. On-board disk controllers most likely have a bigger bandwidth than IO controllers on a 32-bit PCI 33 MHz, 64-bit PCI 66 MHz, or PCI-E x1 bus. Some motherboards are reported to have a bidirectional 20 gigabit bus between the southbridge and the northbridge. Anyway most PCI-busses are connected via the southbridge.

Having a RAID connected over the LAN can be a bottleneck, if the LAN speed is only 1 Gbit/s - this limits the speed of the IO system to 125 MB/s by itself.

There may be some bottlenecks in the software. The software may be written to have an unbalanced access to the media, for example mostly using just one of the drives involved, or having a bias on which drives to use. It is a good idea to monitor the use of each of the drives' performance, for example via iostat. Threading, and asyncroneous IO may also enhance the performance. Related is the use of multicore CPU - are the CPUs used in a balanced way?

Compiler optimization may not have been done properly.

Classical bottlenecks are PATA drives placed on the same DMA channel, or the same PATA cable. This will of cause limit performance, but it should work, given you have no other means of connecting your disks by. Also placing more than one element of an array on the same disk hurts performace seriously, and also gives redundancy problems.

A classical problem is also not to have enabled DMA transfer, or having lost this setting due to some problem, including not well connected cables, or setting the transfer speed to less than optimal.

CPU usage may be a bottleneck, also combined with slow RAM.

BIOS settings may also impede your performance.

The information is quite dated, as can be seen from both the hardware and software specifications.

This section contains a number of benchmarks from a real-world system using software RAID. There is some general information about benchmarking software too.

Benchmark samples were done with the bonnie program, and at all times on files twice- or more the size of the physical RAM in the machine.

The benchmarks here only measures input and output bandwidth on one large single file. This is a nice thing to know, if it's maximum I/O throughput for large reads/writes one is interested in. However, such numbers tell us little about what the performance would be if the array was used for a news spool, a web-server, etc. etc. Always keep in mind, that benchmarks numbers are the result of running a "synthetic" program. Few real-world programs do what bonnie does, and although these I/O numbers are nice to look at, they are not ultimate real-world-appliance performance indicators. Not even close.

For now, I only have results from my own machine. The setup is:

  • Three IBM UltraStar 9ES 4.5 GB, SCSI U2W
  • One IBM UltraStar 9ES 4.5 GB, SCSI UW
  • Kernel 2.2.7 with RAID patches

The three U2W disks hang off the U2W controller, and the UW disk off the UW controller.

It seems to be impossible to push much more than 30 MB/s thru the SCSI busses on this system, using RAID or not. My guess is, that because the system is fairly old, the memory bandwidth sucks, and thus limits what can be sent thru the SCSI controllers.

RAID-0

Read is Sequential block input, and Write is Sequential block output. File size was 1GB in all tests. The tests where done in single-user mode. The SCSI driver was configured not to use tagged command queuing.


From this it seems that the RAID chunk-size doesn't make that much of a difference. However, the ext2fs block-size should be as large as possible, which is 4kB (eg. the page size) on IA-32.

       |            |              |             |              |
       |Chunk size  |  Block size  |  Read kB/s  |  Write kB/s  |
       |            |              |             |              |
       |4k          |  1k          |  19712      |  18035       |
       |4k          |  4k          |  34048      |  27061       |
       |8k          |  1k          |  19301      |  18091       |
       |8k          |  4k          |  33920      |  27118       |
       |16k         |  1k          |  19330      |  18179       |
       |16k         |  2k          |  28161      |  23682       |
       |16k         |  4k          |  33990      |  27229       |
       |32k         |  1k          |  19251      |  18194       |
       |32k         |  4k          |  34071      |  26976       |

RAID-0 with TCQ

This time, the SCSI driver was configured to use tagged command queuing, with a queue depth of 8. Otherwise, everything's the same as before.

       |            |              |             |              |
       |Chunk size  |  Block size  |  Read kB/s  |  Write kB/s  |
       |            |              |             |              |
       |32k         |  4k          |  33617      |  27215       |


No more tests where done. TCQ seemed to slightly increase write performance, but there really wasn't much of a difference at all.

RAID-5

The array was configured to run in RAID-5 mode, and similar tests where done.

       |            |              |             |              |
       |Chunk size  |  Block size  |  Read kB/s  |  Write kB/s  |
       |            |              |             |              |
       |8k          |  1k          |  11090      |  6874        |
       |8k          |  4k          |  13474      |  12229       |
       |32k         |  1k          |  11442      |  8291        |
       |32k         |  2k          |  16089      |  10926       |
       |32k         |  4k          |  18724      |  12627       |


Now, both the chunk-size and the block-size seems to actually make a difference.

RAID-1+0

RAID-1+0 is "mirrored stripes", or, a RAID-1 array of two RAID-0 arrays. The chunk-size is the chunk sizes of both the RAID-1 array and the two RAID-0 arrays. I did not do test where those chunk-sizes differ, although that should be a perfectly valid setup.

       |            |              |             |              |
       |Chunk size  |  Block size  |  Read kB/s  |  Write kB/s  |
       |            |              |             |              |
       |32k         |  1k          |  13753      |  11580       |
       |32k         |  4k          |  23432      |  22249       |


No more tests where done. The file size was 900MB, because the four partitions involved where 500 MB each, which doesn't give room for a 1G file in this setup (RAID-1 on two 1000MB arrays).

Fresh benchmarking tools

To check out speed and performance of your RAID systems, do NOT use hdparm. It won't do real benchmarking of the arrays.

Instead of hdparm, take a look at the tools described here: IOzone and Bonnie++.

IOzone is a small, versatile and modern tool to use. It benchmarks file I/O performance for read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread, mmap, aio_read and aio_write operations. Don't worry, it can run on any of the ext2, ext3, reiserfs, JFS, or XFS filesystems in OSDL STP.

You can also use IOzone to show throughput performance as a function of number of processes and number of disks used in a filesystem, something interesting when it's about RAID striping.

Although documentation for IOzone is available in Acrobat/PDF, PostScript, nroff, and MS Word formats, we are going to cover here a nice example of IOzone in action:

 iozone -s 4096

This would run a test using a 4096KB file size.

And this is an example of the output quality IOzone gives

         File size set to 4096 KB
         Output is in Kbytes/sec
         Time Resolution = 0.000001 seconds.
         Processor cache size set to 1024 Kbytes.
         Processor cache line size set to 32 bytes.
         File stride size set to 17 * record size.
                                                             random  random    bkwd  record  stride
               KB  reclen   write rewrite    read    reread    read   write    read rewrite    read   fwrite frewrite   fread  freread
             4096       4   99028  194722   285873   298063  265560  170737  398600  436346  380952    91651   127212  288309   292633


Now you just need to know about the feature that makes IOzone useful for RAID benchmarking: the file operations involving RAID are the read strided. The example above shows a 380.952kB/sec. for the read strided, so you can go figure.

Bonnie++ seems to be more targeted at benchmarking single drives that at RAID, but it can test more than 2TB of storage on 32-bit machines, and tests for file creat, stat, unlink operations.