I’m very thankful for RAID 5 at the moment…gotta love that parity thing. I checked my server status emails this morning, only to find these lines in /proc/mdstat:
md1 : active raid5 sda4[0] sdc4[2] sdb4[3](F)
576283520 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
Hmm, that little F doesn’t look too promising, and one of the U’s is missing. So I look into this a bit further and find:
root@rio:/rio# mdadm --detail /dev/md1
/dev/md1:
Raid Level : raid5
Array Size : 576283520 (549.59 GiB 590.11 GB)
Device Size : 288141760 (274.79 GiB 295.06 GB)
Raid Devices : 3
Total Devices : 3
Active Devices : 2
Working Devices : 2
Failed Devices : 1
…
Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 0 0 - removed
2 8 36 2 active sync /dev/sdc4
3 8 20 - faulty /dev/sdb4
Sure enough, after checking /var/log/messages, last night at around 8pm a disk failed…
kernel: ata2: status=0x25 { DeviceFault CorrectedError Error }
kernel: SCSI error : return code = 0x8000002
kernel: sdb: Current: sense key: Hardware Error
kernel: Additional sense: No additional sense information
kernel: end_request: I/O error, dev sdb, sector 18912489
kernel: RAID5 conf printout:
kernel: --- rd:3 wd:2 fd:1
kernel: disk 0, o:1, dev:sda4
kernel: disk 1, o:0, dev:sdb4
kernel: disk 2, o:1, dev:sdc4
kernel: RAID5 conf printout:
kernel: --- rd:3 wd:2 fd:1
kernel: disk 0, o:1, dev:sda4
kernel: disk 2, o:1, dev:sdc4
I’m a bit surprised because the drives I used for this RAID are manufactured by Seagate, which I’ve had luck with in the past. Fortunately, Seagate offers a 5 year warranty for all of it’s drives, so this one is going back to the manufacturer to be replaced. In the mean time, I ordered another disk with overnight shipping–I need to take care of this before leaving for WWDC on Saturday. 🙂
Update (8/4): The replacement disk arrived yesterday afternoon and I was able to re-add partitions to the RAID volumes using mdadm <raid volume device> --add <disk device>. Rebuilding went pretty quick–/usr finished rebuilding in less than a minute and the larger volume took just over an hour and a half:
Personalities : [raid5]
md1 : active raid5 sdb4[3] sda4[0] sdc4[2]
576283520 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
[>....................] recovery = 2.7% (8006784/288141760)
finish=98.3min speed=47479K/sec
md0 : active raid5 sdb1[1] sda1[0] sdc1[2]
5863552 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
RAID 5 saves the day. Now just to add ZFS to your levels of protection.
dl
Drives are drives… regardless of brand they all end up failing eventually.
ZFS would be very nice, but since this server is already in production, I would need something easy to migrate to while running Linux. Now if ZFS were ported to Linux, that would be cool. 🙂 I think there is an open source project working on this actually.
As for the drive failure, very true that all drives end up failing eventually. I’m just surprised this drive failed so quickly, it’s only about 9 months old.