22 March 2016

ZFS vs Hardware Raid

Due to the need of upgrading our storage space and the fact that we have in our machines 2 raid controllers, one for the internal disks and one for the external disks, the possibility to use a software raid instead of a traditional hardware based raid was tested.
Since ZFS is the most advanced system in that respect, ZFS on Linux was tested for that purpose and proved to be a good choice here too.
This post will describe the general read/write and failure tests, and a later post will describe additional tests like rebuilding of the raid if a disk fails, different failure scenarios, setup and format times.
Please, use the comment section if you would like to have other tests done too.


harware test configuration:

  1. DELL PowerEdge R510 
  2. 12x2TB SAS (6Gbps) internal storage on a PERC H700 controller
  3. 2 external MD1200 devices with 12x2TB SAS (6Gbps)on a PERC H800 controller
  4. 24GB RAM
  5. 2 x Intel Xeon E5620 (2.4GHz)
  6. for all settings in the raid controllers the default was used for all tests, except for cache which was set to "write through"
ZFS test system configuration:
  1. SL6 OS
  2. ZFS based on the latest version available in the repository 
  3. no ZFS compression used
  4. 1xraidz2 + hotspare for all the disks on H700  (zpool tank)
  5. 1xraidz2 + hotspare for all the disks on H800  (zpool tank800)
  6. in both raid controllers each disk is defined as a single raid0 since they don't support JBOD, unfortunately
Hardware raid test system configuration:
  1. same machine with same disks, controllers, and OS used as for the ZFS test configuration
  2. 1xraid6 + hotspare for all the disks on H700
  3. 1xraid6 + hotspare for all the disks on H800 
  4. space was divided into 8TB partitions and formatted with ext4

Read/Write speed test



  • time (dd if=/dev/zero of=/tank800/test10G bs=1M count=10240 && sync)
  • time (dd if=/tank800/test10G of=/dev/null bs=1M && sync)
  • first number in the results is given by "dd"
  • time and second number is given by "time"
  • write test was done first for both controllers, and then the read tests


  • H700 results

    ZFS based:
    write: 236MB/s, 1min:02 (165MB/s)
    read:  399MB/s, 0min:27 (379MB/s)

    Hardware raid based:
    write: 233MB/s, 1min:10 (146MB/s)
    read:    1.2GB/s, 0min:18 (1138MB/s)

    H800 results

    ZFS based:
    write: 619MB/s, 0min:23 (445MB/s)
    read:  2.0GB/s, 0min:05 (2048MB/s)

    Hardware raid based:
    write: 223MB/s, 1min:13 (140MB/s)
    read:  150MB/s, 1min:12 (142MB/s)

    H700 and H800 mixed

    • 6 disks from each controller were used together in a combined raid configuration
    • this kind of configuration is not possible for a hardware based raid
    ZFS result:
    write: 723MB/s, 0min:37 (277MB/s)
    read:  577MB/s, 0min:18 (568MB/s)

    Conclusion

    • ZFS rates for H800 based raid much better than hardware raid based system
    • the large difference between  ZFS and hardware raid based reads needs more investigation
      • for repeating the same tests 2 more times it was at the same order, however
    • H800 has a much better performance than H700 when using ZFS, but not for the hardware raid configuration

    Failure Test

    Here it was tested what happens if a 100GB file (test.tar) is copied (cp and rsync) from the H800 based raid to the H700 based raid and during this copy the system failed, simulated by cold reboot through the remote console.

    ZFS result:

    root@pool6 ~]# ls -lah /tank 
    total 46G
    drwxr-xr-x.  2 root root    5 Mar 19 20:11 .
    dr-xr-xr-x. 26 root root 4.0K Mar 19 20:17 ..
    -rw-r--r--.  1 root root  16G Mar 19 19:07 test10G
    -rw-r--r--.  1 root root  13G Mar 19 20:12 test.tar
    -rw-------.  1 root root  18G Mar 19 20:06 .test.tar.EM379W

    [root@pool6 ~]# df -h /tank
    Filesystem      Size  Used Avail Use% Mounted on
    tank             16T   46G   16T   1% /tank
    [root@pool6 ~]# du -sch /tank
    46G     /tank
    46G     total

    [root@pool6 ~]# rm /tank/*test.tar*
    rm: remove regular file `/tank/test.tar'? y
    rm: remove regular file `/tank/.test.tar.EM379W'? y
    [root@pool6 ~]# du -sch /tank
    17G     /tank
    17G     total

    [root@pool6 ~]# ls -la /tank
    total 16778239
    drwxr-xr-x.  2 root root           3 Mar 19 20:21 .
    dr-xr-xr-x. 26 root root        4096 Mar 19 20:17 ..
    -rw-r--r--.  1 root root 17179869184 Mar 19 19:07 test10G
    • everything consistent
    • no file check needed at reboot
    • no problems at all occurred 

    Hardware raid based result:

    [root@pool7 gridstorage02]# ls -lhrt
    total 1.9G
    drwx------    2 root   root    16K Jun 26  2012 lost+found
    drwxrwx---   91 dpmmgr dpmmgr 4.0K Feb  4  2013 ildg
    -rw-r--r--    1 root   root      0 Mar  6  2013 thisisgridstor2
    drwxrwx---   98 dpmmgr dpmmgr 4.0K Aug  8  2013 lhcb
    drwxrwx---  609 dpmmgr dpmmgr  20K Aug 27  2014 cms
    drwxrwx---    6 dpmmgr dpmmgr 4.0K Nov 23  2014 ops
    drwxrwx---    6 dpmmgr dpmmgr 4.0K Mar 13 12:18 ilc
    drwxrwx---    9 dpmmgr dpmmgr 4.0K Mar 13 23:04 lsst
    drwxrwx---  138 dpmmgr dpmmgr 4.0K Mar 14 10:23 dteam
    drwxrwx--- 1288 dpmmgr dpmmgr  36K Mar 15 00:00 atlas
    -rw-r--r--    1 root   root   1.9G Mar 18 17:11 test.tar

    [root@pool7 gridstorage02]# df -h .
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sdb2             8.1T  214M  8.1T   1% /mnt/gridstorage02

    [root@pool7 gridstorage02]# du . -sch
    1.9G    .
    1.9G    total

    [root@pool7 gridstorage02]# rm test.tar 
    rm: remove regular file `test.tar'? y

    [root@pool7 gridstorage02]# du . -sch
    41M     .
    41M     total

    [root@pool7 gridstorage02]# df -h .
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sdb2             8.1T -1.7G  8.1T   0% /mnt/gridstorage02

    • Hardware raid based tests were done first, on a  machine that was previously used as dpm client, therefore the directory structure was left, but empty
    • during the reboot a file system check was done
    • "df"  reports a different number for the used space than "du" and "ls"
    • after removing the file, the used space reported by "df" is negative
    • file system is not consistent anymore

    Conclusion here:

    • for the planned extension (17x2TB exchanged for 8TB disks), the new disks should be placed in the MD devices and managed by the H800 using ZFS
    • second zpool can be used for all remaining 2TB disks (on H700 and H800 together)
    • ZFS seems to handle system failures better 
    To be continued...

    1 comment:

    wazoox said...

    You really should run tests with files much bigger than RAM size, else caching will get in the way and make the results irrelevant. You have 24G of RAM, you should run your tests with 48G fileS.

    You may also run "echo 3 > /proc/sys/vm/drop_cache" to empty the file cache between runs for more consistent results.

    Another point to take into account is the IO scheduler. Most distributions use cfq (Completely Fair Scheduler) as a default, unfortunately it's most of the time a poor choice for a server, particularly when using hardware RAID. Use "noop" scheduler for perfectly fair tests: run "echo noop > /sys/block//queue/scheduler" for all drives.

    Last you may need to adjust IO queue length and read ahead. Default values are quite correct for old ATA drives with small cache, but very suboptimal for RAID arrays. Most RAID controller needs much longer queues than the default 128 (512 or 1024): echo 1024 > /sys/block//queue/nr_requests
    And most hardware RAID controller have large caches that give better results with sequential IO with big read-ahead values: echo 8192 > /sys/block:/queue/read_ahead_kb