Second update: I have finished comparing RAID5 to RAID10. RAID5 is about 20% faster with sequential reads but 20% slower with sequential writes. Orion reports that the total throughput over two RAID5 LUNs is about 20% slower than against two RAID10 LUNs with a 5:95 write:read ratio. Maximum sustained IOP/sec on the "small I/O" test also dropped from over 6000 to 4000 - a 30% decrease. My conclusion is therefore that unless you need maximum possible sequential read performance or maximum possible space utilisation one should avoid RAID5 and pick RAID10.
Update: Just a day after posting this Dell
releases the MD3000i which appears to be an MD3000 using gig/e ports instead of SAS ports. Again, the performance of the Dell MD3000i is entirely obscure but my guess is that it is the same as the MD3000. The controller software certainly sounds very familiar! At least this time the "fully equipped" MD3000i with four gig/e ports can blame any benchmark results on the maximum wire speed of 4 gig/e ports (less than 100mb/sec x 4).
I've been spending some time benchmarking the Dell MD-3000 powervault storage array under SuSE 10.2 x86_64 linux. There isn't a lot of information out there on this unit, one of the more useful pages I found is on this blog:
Performance of the MD3000 with ORION. In summary: it is ok for the price we paid (half retail), but this storage array, with the guts of an old IBM DS4100 which had an anemic 485 MB/sec internal bus speed, is not able to max out the total sequential read or write performance of the 15 disks it is able to contain. I imagine if you expand it with MD-1000 enclosures this deficit is even more obvious. More on that later.
The setup
Two Dell 1950 hosts, each with two SAS/5e HBA (host bus adapter cards). The idea is to setup a highly available configuration. The SAS/5e cards each have two ports but since the MD3000 has a maximum of 4 ports (two per physical controller module) I am only using ONE port on each card, and four SAS cables. Both Dell 1950 poweredge hosts also have two internal drives connected to PERC5/i configured as a single mirror for the OS. The MD3000 does not support booting a Host OS. It is fully populated with 15 SAS seagate 15k (136gb usable) drives.
Theoretical throughput
Each SAS5/e card is PCI-X, and each slot has a dedicated bus on the 1950. Each SAS cable runs at 3 gigabit/second full duplex. Each of the 15 drives can sustain a sequential read of about 90/mb a second and a write of nearly that much. If they were 300gb drives then read performance would be over 100mb/sec. There is a little if any information from Dell on how the 15 drives are connected to the controller modules.
The Dell advertising blurb describes the MD3000 as having a possible peak bandwidth of 1400MB/sec:
Active-active RAID controllers can produce throughputs up to 1400MB/sec and approximately 90000 IOPS from cache
I wonder if it can reach that speed?
Multi-path support the Dell way
Since each host is connected to the MD3000 via two HBAs, two cables and two MD3000 controller modules, transparent fail-over support would be an obvious item on the wish-list. The Dell resource CD provides an RDAC module that you are required to compile up yourself, and a newer mptsas kernel driver. There were several problems implementing things as Dell expect you to:
- Dell provided mptsas does not compile on vanilla kernel releases after 2.6.20 (such as fedora core 7) due to an API change to the work queues
- Dell provided mptsas does not compile on distro kernels patched for "wide port API" because the code checks for kernel version 2.6.18 or more before enabling it. SuSE 10.2 is, for example, kernel 2.6.16 (with patches). This is easily fixed by adding two #defines.
- I also had trouble compiling up RDAC due to an incorrect symlink.
- The Dell provided mptsas module is newer than the LSI official drivers, but there are no release notes or history for it so it isn't clear what it fixes or adds vs the standard module from LSI.
After trying the array successfully with Fedora Core 5, CentOS5 (which is RHEL 5 64bit) and exploring all the above issues, in the end I settled on SuSE SLES-10-SP1 x86_64 (Suse 10 service pack 1 for 64bit) and used it as-is, there was no need to install anything other than the Java "SMdevices/SMmonitor/SMagent" stuff on the resource CD.
Multi-path support via multipath tools
As an alternative to the IBM/Dell RDAC solution I went with multipath-tools.
Linux multipath tools provide some amount of device independent support for multipath IO. In brief once configured correctly they export /dev/dm-N devices that one should use instead of /dev/sd? devices. The /dev/dm-N devices are transparently (hopefully!) failed over and back depending on what the multipath demon finds is going wrong with the underlying devices.
The problem with using multipath-tools on this MD3000 is that you must verify your kernel can speak RDAC in the device-mapper. This support comes in the form of a
bunch of device mapper kernel patches, and fedora core 7 and perhaps most other distros do not have these by default. (I'm out of my depth here!). You'll know that you've not got them because you can't make multipath-tools work. SuSE 10.2 does have the patches.
The multipath configuration file that I used is:
defaults {
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}
devices {
device {
vendor DELL*
product MD3000*
path_grouping_policy failover
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
features "1 queue_if_no_path"
# path_checker readsector0
path_checker rdac
prio_callout "/sbin/mpath_prio_tpc /dev/%n"
hardware_handler "1 rdac"
failback immediate
}
}
blacklist {
device {
vendor Dell.*
product Universal.*
}
device {
vendor Dell.*
product Virtual.*
}
}
As you can see, I don't want multipath tools to try to probe either the DRAC5 management card "virtual devices", or the MD3000 management "Access disk" which appears as a 20gb drive that can't be used as a filesystem. Notice that the configuration file refers to the aforementioned kernel rdac support! (this is not the same as the Dell RDAC driver). Without the right kernel, the "path_checker" line will fail to work as will the "hardware_handler" line.
If all multipath-tools are installed without error then after fresh boot you can do something like this (the -d flag is "dry run" and is more likely to get you output than just -ll if you have any other issues with missing kernel features):
# multipath -ll -d
3600188b00040945c000024b346e14eb1dm-1 DELL ,MD3000
[size=50G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=3][active]
\_ 2:0:0:1 sdf 8:80 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 1:0:0:1 sdc 8:32 [active][ghost]
3600188b00046e0290000638e46e14f64dm-2 DELL ,MD3000
[size=50G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=0][enabled]
\_ 2:0:0:0 sde 8:64 [active][ghost]
\_ round-robin 0 [prio=3][active]
\_ 1:0:0:0 sdb 8:16 [active][ready]
You can see that I setup two LUNs, one mapped through HBA #1 with a backup through HBA #2, and the other the reverse. The hot-standby paths are called "Ghosts" by multipath. If I joined these two LUNs up via LVM2 or mdadm (linux software raid) then in theory I am load balancing between the two HBAs. If a single "path" to the storage array fails (HBA, cable or MD3000 controller failure) then one dm-N device moves to its buddy on the other HBA and we
should still be in business.
Note that if one attempts to access the array via the Ghost devices, or actually has a path failure, and the Ghost devices are accessed then the MD3000 will report via the management console that it is in a "non-optimal" state because a LUN moved from its "preferred controller" to the backup controller. You can trigger this simply by using dd to read from a Ghost device.
Note: if you do not install a multipath solution and put two cables from the host to the MD3000 you will see two LUNs (sd? devices). If you try to use both at the same time, the controllers will thrash, moving the disk array back and forth from slot 0 to slot 1 trying to keep up with your access pattern and performance will be awful. Don't ask how I figured this out.Benchmarking Introduction
So it is all setup, how fast is it?
I played around with benchmarking this thing in a number of different ways. I've used software raid to stripe md0 across dm-1 and dm-2 (hoping to see better throughput when both HBAs are teamed), I've tried LVM2 instead of software raid. I've used the underlying devices directly, with and without partitions, and also tried with Ext3 and XFS. In general the more layers the slower things become. For instance, Ext3 on top of LVM2 on top of dm-N on top of sd? might be 20% slower than just raw access to /dev/sdX
Benchmarking tools I tried varied from simple dd for sequential write and read. hdparm -t for sequential non buffered read, "seeker" (see below) for random single block IO, iozone for a grid of data and Oracle "orion" for a simulation of database workloads.
When running any benchmark you have to be aware of the chain of cache in use for the test. There are two possible caches of concern: the physical memory of the machine running the test & the cache on the raid controllers inside the MD3000 (512mb, supposedly, although it isn't clear if this is 256mb per controller or what). There is also probably a small individual per-drive cache but that would be overwhelmed by the other caches.
In order to avoid testing cache speed instead of I/O array speed I made sure that the hosts were rebooted with mem=768m which means they have minimal memory free for blockio cache. The Orion benchmark tool can take into account an amount of cache before it runs - it fills the cache with random data before performance measurements start - so I took advantage of that flag to kill the MD3000 cache. Other than that I basically tried to make sure the tests involved many times the physical available memory of the host.
It is interesting to note that hdparm -t is nearly useless for this MD3000 because although it avoids any host cache it can't avoid the 256mb controller cache. hdparm -t can show the speed of an MD3000 LUN is 300mb/sec. (pretty much the speed of one SAS cable).
The MD3000 "read" cache
It is possible to disable the 'read' cache on a per LUN basis. I've seen written that read caches on external storage units should almost always be disabled because you want to reserve as much cache space as possible for non-blocking write operations. The read cache can be set with a SMcli script which is fed to the SMcli utility:
set virtualDisk["1"] readCacheEnabled=true readAheadMultiplier=1;
Setting the cache via the gui management interface is not possible.
My conclusions so far are that for throughput tests, disabling the readCache created too big an impact. Performance on long sequential reads dropped remarkably without a readCache. This was unexpected (how can a tiny 256mb cache help with reading 8gig of data sequentially?) unless the readCache is helping the controller modules read ahead using multiple drives in the disk group, and disabling the cache crimps that ability. So I left the readCache enabled.
Total throughput tests
The unit did not perform to theoretical performance with any total throughput tests involving all drives. With 14 drives (7 unique and 7 more ready with duplicate data) total read performance could, theoretically, approach 80x14 = over one gigabyte a second! With a single SAS card and a JBOD array there is certainly evidence of
this performance under linux 2.6. Dual HBA cards and/or two sas cables should support 600mb/sec to the host.
Unfortunately the fastest I could get dd, or iozone to work was about 280mb/sec in one direction. By combining two LUNs using software raid0, to combine controllers, speed rose to 370mb/sec for sequential read.
Orion reported over 400mb/sec total throughput with a mix of read and writes. IOzone would typically report around 300mb/sec seq write and a little more seq read.
By mixing dd out and in, three LUNs, and two disk groups total, throughput grew close to 600mb/sec. Perhaps with further experimentation it would be possible to determine what size disk group is optimal, and what mix of work generates the maximum total throughput.
Single block random reads
Using the
seeker.c random seek/read utility, modified to use 48 bit random numbers seeded correctly (not from time in seconds), and run in parallel 60 times or more, I could push the enclosure to about 6000 IOPs/second at which point adding more work just increased latency with no increase in total IOs per second (reading the IOs from iostat). I think this result more correctly reflects the speed of which a 14 disk LUN can work than did the throughput tests. A single SAS drive can do only a few hundred IO operations per second (depends on mainly on the drive's average seek time).
Oracle 'Orion' benchmark
The
Orion manual is here.
Orion using a 5% write 95% read mix, generated a matrix of results which I include below in a spreadsheet. Scroll the iframe right to see the results graph. Worryingly, however, on
both hosts the full benchmark hard-locked (no kernel panic) the machine about 3/4 the way through the 3+ hour run.
The spreadsheet has two tabs, one for MB/sec the other for IOPs. The matrix of results in the first tab represents a mix of small tasks and large tasks. With many small tasks and no large tasks, total throughput rises more slowly as workload increases (the bottom curve). With all large tasks and no small tasks (top curve), total throughput looks like it will plateau around 400mb/sec as workload increases.
The orion command line is using two LUNs as though they were striped together in a single volume (-simulate raid0). The two LUNs used are not shown, they are listed in a config file that is created before orion runs.
Other Resources:
Note: it appears to me that the MD3000 uses the same controllers as the IBM model DS4100. The same SMcli utility and script, same SMagent/SMmonitor utilities, same controller memory!
Page 4 of the IBM manual gives the performance characteristics of the DS4100 that Dell do not:
IOPS from cache: 70k
IOPS from disk: 10k
Disk through-put: 485MB/sec (although DS4100 only supported slow SATA drives)
You can pick up fully populated (SATA) DS4100s on ebay for $5k, retail was $21k. Pictures of the rear reveals a very similar controller arrangement of ports.
Conclusion
Well, I am rather miffed that the MD3000 is pimped on the Dell site as a state-of-the-art (albeit lower-end) modular storage array but is actually an IBM DS4100 in dell drag - very "End Of Life" gear, no?
Documentation is appalling (the IBM manual is very good, however. Shame Dell didn't copy that as well). There is much more information on tuning the enclosure from IBM, which is also providing the RDAC kernel module, although IBM information is for their DS4100 only.
Performance is adequate for the dollars (we bought ours on ebay as reconditioned equipment) but the controller modules are clearly not capable of driving 15k SAS drives to their limit, and the controller cache memory dates from an era where 1gb of host memory was a big deal!