Seagate’s new third generation hybrid drive combines 8GB of MLC NAND SSD with a 1TB mechanical drive spinning at 5400RPM. They’re using the term solid state hybrid drive or SSHD for the product line. The big advance of these second generation hybrid drives is that the cache can be used for writes, and that write cache is protected by capacitors. It is supposed to shut down cleanly when the power dies.
In theory that makes this drive a replacement for the common hard drive plus battery-backed write cache combination often used for good database performance. This could be even better than the usual combination that pairs 1GB of BBWC RAM on a RAID controller with a 1TB hard drive, because it had a good bit more flash capacity. Buy two of these Seagate drives, use software RAID, and you’ve put together a very inexpensive combination that can safely break the rotation “speed of light” that limits maximum commit rate. A good BBWC RAID card will typically allow around 10,000 small commits/second into a regular disk drive, even though the drive itself might only do <=250 rotations/second.
I picked up the laptop model of this drive, the first one released, for $120 from Newegg. Tests here show the drive does shutdown cleanly as hoped, and the commit rate is a good bit higher than the 5400 RPM physical commit rate (which would only be 90/second). It’s not in the same class as a BBWC or a full SSD though, since it’s only giving me 816 commits/second here. There are some performance quirks and the seeking speed is as terrible as any 5400 RPM drive. But if you just want a good chunk of storage, durable writes, and moderate ability to handle write bursts, this drive will do all that quite cheaply.
My drive is a ST1000LM014-1EJ164:
# smartctl -i /dev/sdd smartctl 5.39.1 2010-01-28 r3054 [x86_64-koji-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST1000LM014-1EJ164 Serial Number: W3801P6G Firmware Version: SM11 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 9 ATA Standard is: Not recognized. Minor revision code: 0x1f Local Time is: Sun Jun 2 18:27:27 2013 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
Partition and mount
Using this under Linux complained at several steps about partition alignment, since this drive has 4096 byte sectors:
# fdisk /dev/sdd The device presents a logical sector size that is smaller than the physical sector size. Aligning to a physical sector (or optimal I/O) size boundary is recommended, or performance may be impacted. ... Device Boot Start End Blocks Id System /dev/sdd1 1 121601 976760001 83 Linux Partition 1 does not start on physical sector boundary. # mkfs.xfs /dev/sdd1 warning: device is not properly aligned /dev/sdd1
It’s easy enough to get Linux’s fdisk to align things at sector boundaries though, so I did that:
[root@toy ~]# fdisk /dev/sdd The device presents a logical sector size that is smaller than the physical sector size. Aligning to a physical sector (or optimal I/O) size boundary is recommended, or performance may be impacted. WARNING: DOS-compatible mode is deprecated. It's strongly recommended to switch off the mode (command 'c') and change display units to sectors (command 'u'). Command (m for help): c DOS Compatibility flag is not set Command (m for help): u Changing display/entry units to sectors Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First sector (2048-1953525167, default 2048): Using default value 2048 Last sector, +sectors or +size{K,M,G} (2048-1953525167, default 1953525167): Using default value 1953525167 Command (m for help): p Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0xcc1530bf Device Boot Start End Blocks Id System /dev/sdd1 2048 1953525167 976761560 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. [root@toy ~]# mkfs.xfs1 /dev/sdd1 -bash: mkfs.xfs1: command not found [root@toy ~]# mkfs.xfs /dev/sdd1 meta-data=/dev/sdd1 isize=256 agcount=32, agsize=7630950 blks = sectsz=512 attr=2 data = bsize=4096 blocks=244190390, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=119233, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
There’s useful article showing how to do this with the parted GUI tool at http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/
I used XFS for the testing. In order to take advantage of the write cache, write barriers have to be disabled at mount time. I used my standard option set for XFS mounting with good performance when the write cache is trusted:
[root@toy ~]# mount -t xfs /dev/sdd1 /ssd -o noatime,nobarrier,logbufs=8 [root@toy ~]# mount | grep ssd /dev/sdd1 on /ssd type xfs (rw,noatime,nobarrier,logbufs=8)
Simple performance results
Testing with dd and iostat, sequential write performance peaks at just under 110MB/s. Here are results from iostat -mx 5:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 234.40 0.00 104.30 911.31 142.32 595.26 4.27 100.02 sdd 0.00 0.00 0.00 245.80 0.00 109.11 909.08 142.67 592.99 4.07 100.00 sdd 0.00 0.20 0.00 241.80 0.00 107.41 909.71 142.96 587.87 4.14 100.00 sdd 0.00 0.00 0.00 244.60 0.00 108.81 911.02 142.87 584.29 4.09 100.00 sdd 0.00 0.00 0.00 213.80 0.00 94.89 909.00 142.84 665.25 4.68 100.02 sdd 0.00 0.00 0.00 216.00 0.00 96.09 911.11 143.07 665.43 4.63 100.00 sdd 0.00 0.00 0.00 214.60 0.00 95.49 911.32 143.10 663.59 4.66 100.00 sdd 0.00 0.20 0.00 230.00 0.00 102.30 910.92 143.31 628.01 4.35 100.00 sdd 0.00 0.00 0.00 242.00 0.00 107.70 911.48 142.89 586.28 4.13 100.00 sdd 0.00 0.00 0.00 242.20 0.00 107.71 910.74 142.79 592.41 4.13 100.02 sdd 0.00 0.00 0.00 245.60 0.00 109.11 909.82 142.98 579.83 4.07 100.00 sdd 0.00 0.00 0.00 242.60 0.00 107.91 910.92 142.88 590.87 4.12 100.02
Note the variation in write rate, how it drops suddenly to 95MB/s and then recovers to full speed. That shows up regularly with this drive. Sequential read of that file has almost the same speed, but without as much variability:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 213.80 0.00 106.90 0.00 1024.00 12.26 57.33 4.68 100.00 sdd 0.00 0.00 216.20 0.00 108.20 0.00 1024.95 12.24 56.56 4.63 100.00 sdd 0.00 0.00 215.20 0.00 107.50 0.00 1023.05 12.23 56.96 4.65 100.00 sdd 0.00 0.00 215.40 0.00 107.70 0.00 1024.00 12.25 56.79 4.64 100.02 sdd 0.00 0.00 215.40 0.00 107.70 0.00 1024.00 12.24 56.86 4.64 100.00 sdd 0.00 0.00 215.00 0.00 107.50 0.00 1024.00 12.45 56.95 4.65 100.00 sdd 0.00 0.00 209.80 0.20 104.90 0.00 1023.04 12.84 62.13 4.76 100.00 sdd 0.00 0.00 215.00 0.00 107.50 0.00 1024.00 12.24 56.96 4.65 100.00
Commit performance and reliability
Initially I tested the commit rate with write barriers on, to see what the raw capabilities of the spinning drive are like. With 16KB block writes that had I/O statistics like this:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.60 0.00 227.40 0.00 1.89 16.98 0.99 4.34 4.31 98.12 sdd 0.00 0.00 0.00 229.40 0.00 1.90 16.99 0.98 4.28 4.28 98.14 sdd 0.00 0.00 0.00 257.60 0.00 2.14 17.00 0.98 3.81 3.81 98.10 sdd 0.00 0.00 0.00 247.80 0.00 2.06 16.99 0.98 3.96 3.95 97.98 sdd 0.00 0.00 0.00 229.20 0.00 1.90 17.00 0.98 4.27 4.27 97.94
And the average out of a sysbench fsync test was 115.11 Requests/s. The logical start of the drive doesn’t do too badly on this test, I was expecting closer the 5400 RPM limit of 90/s.
Remounting with drive barriers on gives much better sysbench fsync results:
Operations performed: 0 reads, 10000 writes, 10000 Other = 20000 Total Read 0b Written 156.25Mb Total transferred 156.25Mb (12.754Mb/sec) 816.24 Requests/sec executed Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 1279.20 0.00 10.62 17.00 0.95 0.74 0.74 94.64 sdd 0.00 0.00 0.00 1271.60 0.00 10.56 17.00 0.95 0.75 0.75 95.00
Next I ran diskchecker.pl to see if the write cache was reliable. That managed to get about 200 of its commits per second in:
diskchecker: running 1 sec, 0.07% coverage of 500 MB (22 writes; 22/s) diskchecker: running 2 sec, 0.27% coverage of 500 MB (85 writes; 42/s) diskchecker: running 3 sec, 1.23% coverage of 500 MB (397 writes; 132/s) diskchecker: running 4 sec, 2.02% coverage of 500 MB (656 writes; 164/s) diskchecker: running 5 sec, 2.84% coverage of 500 MB (924 writes; 184/s) diskchecker: running 6 sec, 3.73% coverage of 500 MB (1215 writes; 202/s) diskchecker: running 7 sec, 4.18% coverage of 500 MB (1363 writes; 194/s)
Interestingly, the reboot after pulling the plug and corrupting the filesystem took a long time to mount:
Jun 2 18:43:19 toy kernel: XFS (sdd1): Mounting Filesystem Jun 2 18:43:19 toy kernel: XFS (sdd1): Starting recovery (logdev: internal) Jun 2 18:44:29 toy kernel: XFS (sdd1): Ending recovery (logdev: internal)
10 seconds seems like an eternity when you’re waiting to see if your disks are intact. And I had to look /var/log/messages to see this information, it wasn’t shown in the console mounting the drive. But this is good, because recovering from a filesystem corruption should take a bit of time when writes are being flushed onto temporary flash RAM. Once the system is back up the drive should be spooling that data out to regular disk, and it may not respond normally until that’s finished.It’s when the data in the write cache goes away altogether that you’re in trouble.
The verification test worked perfectly after this wait:
# ~gsmith/diskchecker.pl -s 192.168.1.110 verify test_file verifying: 100.00% Total errors: 0
There is a lot of variability in the drive’s write rate. Here’s another sample from iostat data collected every 5 seconds, this time when I was writing out 80GB of sysbench test files:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 218.80 0.00 109.40 1024.00 143.45 658.74 4.57 100.00 sdd 0.00 0.00 0.00 137.60 0.00 68.30 1016.60 84.99 613.93 4.74 65.18 sdd 0.00 0.00 0.00 212.60 0.00 106.20 1023.05 143.72 674.43 4.70 100.02 sdd 0.00 0.00 0.00 150.00 0.00 74.90 1022.64 100.17 735.13 5.23 78.42 sdd 0.00 0.20 0.00 178.04 0.00 88.63 1019.45 124.01 645.06 4.85 86.43 sdd 0.00 0.00 0.00 214.03 0.00 106.91 1023.05 142.43 667.50 4.68 100.06 sdd 0.00 0.00 0.00 134.60 0.00 67.20 1022.49 85.78 628.79 4.85 65.26 sdd 0.00 0.20 0.00 219.20 0.00 109.00 1018.43 143.39 658.30 4.56 100.00
Given how writes can be cached in the limited amount of SSD on the disk, it’s not a surprise that some writes are faster than others as the cache fills. The write speed of the mechanical part of the drive looks close to 70MB/s, which is about right for a 5400 RPM laptop drive. Having a very even cut-off in performance, exactly the same speed limit, is also a common SSD characteristic. From all of these results it looks like the SSD in this drive is tuned to deliver 110MB/s on reads and writes.
Results on a 80GB seek-scaling test suite is terrible, as expected for what’s really a mechanical drive under the hood. I get 0.73MB/s with a single client and that only grows to 1.15MB/s with 96 of them. You are not going to confuse this drive with a real SSD like Intel’s 320 series when the workload is really new random I/O:
Summary
I was hoping for slightly better commit performance here, but it is a solid 7X faster than a typical 7200 RPM drive. The mechanical performance of this drive is as mediocre as expected from its slowly rotating internals. A quick summary:
Drive | Write | Read | Commit Rate | Seeks @1 Thread | Seeks @ 96 Threads |
---|---|---|---|---|---|
Hybrid side | 107.70MB | 107.70MB/s | 816.24/s | 0.73MB/s | 1.15MB/s |
Mechanical | <=70MB/s | 107.70MB/s | 115/s |
It’s not a bad set of trade-offs for some inexpensive database servers though. You get 1TB of storage and burst rates up to 110MB/s on the read and write side, which is better than some cheap SSDs manage. And the handling of writes finally looks reliable, delivering on the promise Seagate’s hybrid drives have held out since their introduction. I couldn’t recommend any of their earlier hybrid drives for a database, but this one is useful to consider for your PostgreSQL databases.
The post Disks for Databases: 3rd Gen Seagate Hybrid Drive appeared first on High Performance PostgreSQL.