Double disk failures – A storage nightmare

Anyone who has worked with storage systems, or even large personal installs has heard of them. Double disk failures. Words you never say. Ever! You can be banished from the server room for even suggesting it is possible!

But the reality is it can, and does happen. It is why we have hot swappable disks, or even hot swappable drives. I’m even looking at some array by NetApp which has something called DP or Dual Parity which, they say, can handle two separate disk failures without taking down the array. Something that sounds very interesting really. The Dell / Equallogic array we have on test currently runs in a type of raid 50 so you can lose two disks but only from separate arrays. The other two disks are running as hot standby disks to allow for online rebuilding.

The setup

My current, dilemma we’ll call it, is with a much simpler setup. Intel based server with 8 SATA disks connected to a 3ware card doing Raid-5. It is a high end 3ware card too, a 9650. (I do NOT recommend these cards. We have numerous other performance issues with the cards in both Windows and Linux, the Windows ones being much worse, currently stopping me copying backups). Anyway, to make things a little more challenging, something every admin loves in their day is a challenge, the server is remote. In another country remote.

Anyway, this machine has been running fine for nearly a year. Raid array sitting there taking files happily enough. When I started testing some further backups recently, I ran into some troubles. Most of it looked to be Windows related so the usual apply the updates, reboot the machine and see what happens. Only on the first reboot, wham, disk 8 offline. Ok, so I’ll finish the updates and then worry about getting another disks over to be put into the machine. Next reboot, disk comes magically back online but in a degraded state. Strange, we’ll let this rebuild and return tomorrow, see if live has returned to normal.

Normal is normal is just a cover

Sleeping on things and letting the array rebuild and everything looks to be great and just a temporary problem that we can forget about and move on. Never a good idea but when you are overworked, what can you do?

Another day passes trying to move backups across and we hit another windows error. This time requiring a registry fix to increase the IRQStackSize. So I bang in the first change and reboot. Login and strange, the system is locking up it appears. Open the 3ware webpage and get prompted with something  I’d not seen until now.

Raid Status: Inoperable

Luckily these are backups, no live data lost. We can fix this. Hell lets try a reboot and see. Can’t do anymore damage can it?

The Recovery?

Rebooting fixes disks, magically. Both disks back online. Array in a consistent state. Why not leave well enough alone?

More windows problems and another reboot. Back to two disks offline. Reboot again and one disk gone. Useless, useless, useless.

Solutions…

If this was a live server, with live data? I’d probably cry. There’d not be much else to do. You could probably have it rebuild by replacing the disk that was going offline the most, but I’d move as much off as quick as possible. In this case, since it is a backup server, I’ll be getting the guys local to the machine to remove and reseat all the drives. And check the cables inside the case. And then destroy and reformat the array, and the filesystem, with full formats all around.

And then to top it off, 10 reboots, minimum, when the server isn’t looking! If they all work, then maybe, just maybe I’ll look at trusting it again. Any problems and I guess I’m on a plane 🙁

Lessons learned

Well I think I’ll be putting the really critical data onto more than one backup server in future. At least more of the fileserver data anyway. The massive exchange backups will need to be looked at.

Enterprise level SANs are cheaper than you think when you factor in the cost of fixing setups like this. Okay so you aren’t going to be able to get a SAN for twice the price of a server with 16x1TB drives in it, or even three times. You may get a low spec’d one however, and if it gives more piece of mind, maybe that is worth the cost? I know that if faced with the decision in future, I’ll probably recommend a SAN and attached server for a file server assuming it is above the 1TB mark. Lower than that, you can probably get anyway with the multiple servers, replication software AND backups. Replication software is NOT backup software. Delete from live, deletes from backup.

And what nows?

I don’t know. All I can hope is that reseating disks and cables fixes the array, gets it online and lets me start transferring backups offsite. Another box is going to be added to give more backups, hopefully point in time backups too.

Backups really are the largest cost for something you never will use. I do honestly hope I never have to pull any data from backups, ever. It is possible what with Volume Shadow Copies on file servers and raid disks for servers. And maybe real permissions for applications, but that is another day!

Leave a Reply

Your email address will not be published. Required fields are marked *