Double disk failures – A storage nightmare

Anyone who has worked with storage systems, or even large personal installs has heard of them. Double disk failures. Words you never say. Ever! You can be banished from the server room for even suggesting it is possible!

But the reality is it can, and does happen. It is why we have hot swappable disks, or even hot swappable drives. I’m even looking at some array by NetApp which has something called DP or Dual Parity which, they say, can handle two separate disk failures without taking down the array. Something that sounds very interesting really. The Dell / Equallogic array we have on test currently runs in a type of raid 50 so you can lose two disks but only from separate arrays. The other two disks are running as hot standby disks to allow for online rebuilding.

The setup

My current, dilemma we’ll call it, is with a much simpler setup. Intel based server with 8 SATA disks connected to a 3ware card doing Raid-5. It is a high end 3ware card too, a 9650. (I do NOT recommend these cards. We have numerous other performance issues with the cards in both Windows and Linux, the Windows ones being much worse, currently stopping me copying backups). Anyway, to make things a little more challenging, something every admin loves in their day is a challenge, the server is remote. In another country remote.

Anyway, this machine has been running fine for nearly a year. Raid array sitting there taking files happily enough. When I started testing some further backups recently, I ran into some troubles. Most of it looked to be Windows related so the usual apply the updates, reboot the machine and see what happens. Only on the first reboot, wham, disk 8 offline. Ok, so I’ll finish the updates and then worry about getting another disks over to be put into the machine. Next reboot, disk comes magically back online but in a degraded state. Strange, we’ll let this rebuild and return tomorrow, see if live has returned to normal.

Normal is normal is just a cover

Sleeping on things and letting the array rebuild and everything looks to be great and just a temporary problem that we can forget about and move on. Never a good idea but when you are overworked, what can you do?

Another day passes trying to move backups across and we hit another windows error. This time requiring a registry fix to increase the IRQStackSize. So I bang in the first change and reboot. Login and strange, the system is locking up it appears. Open the 3ware webpage and get prompted with somethingĀ  I’d not seen until now.

Raid Status: Inoperable

Luckily these are backups, no live data lost. We can fix this. Hell lets try a reboot and see. Can’t do anymore damage can it?

The Recovery?

Rebooting fixes disks, magically. Both disks back online. Array in a consistent state. Why not leave well enough alone?

More windows problems and another reboot. Back to two disks offline. Reboot again and one disk gone. Useless, useless, useless.


If this was a live server, with live data? I’d probably cry. There’d not be much else to do. You could probably have it rebuild by replacing the disk that was going offline the most, but I’d move as much off as quick as possible. In this case, since it is a backup server, I’ll be getting the guys local to the machine to remove and reseat all the drives. And check the cables inside the case. And then destroy and reformat the array, and the filesystem, with full formats all around.

And then to top it off, 10 reboots, minimum, when the server isn’t looking! If they all work, then maybe, just maybe I’ll look at trusting it again. Any problems and I guess I’m on a plane šŸ™

Lessons learned

Well I think I’ll be putting the really critical data onto more than one backup server in future. At least more of the fileserver data anyway. The massive exchange backups will need to be looked at.

Enterprise level SANs are cheaper than you think when you factor in the cost of fixing setups like this. Okay so you aren’t going to be able to get a SAN for twice the price of a server with 16x1TB drives in it, or even three times. You may get a low spec’d one however, and if it gives more piece of mind, maybe that is worth the cost? I know that if faced with the decision in future, I’ll probably recommend a SAN and attached server for a file server assuming it is above the 1TB mark. Lower than that, you can probably get anyway with the multiple servers, replication software AND backups. Replication software is NOT backup software. Delete from live, deletes from backup.

And what nows?

I don’t know. All I can hope is that reseating disks and cables fixes the array, gets it online and lets me start transferring backups offsite. Another box is going to be added to give more backups, hopefully point in time backups too.

Backups really are the largest cost for something you never will use. I do honestly hope I never have to pull any data from backups, ever. It is possible what with Volume Shadow Copies on file servers and raid disks for servers. And maybe real permissions for applications, but that is another day!

Dell overheating problems, Windows Search and Acronis restore

So it seems that Dell or more so nVidea have some over heating problems with some of the gpus. My D630c had been running really hot for quite a while and the fan was going a bit nuts during windows startup until last weekend… when the system decided to put random characters onscreen and die. As with all things, the laptop booted up fine on the Tuesday when I called Dell however running the system diags did reproduce the problem. While testing further, a hard drive problem popped up so they agreed to change the disk. When I mentioned about the heat problem, I was quickly put on hold for a few minutes, then they came back and said they were replacing the motherboard and fans. A quick Google did show up a few things about the failling GPUs.

Anyway, Dell did replace everything and things are working fine since. The replacement harddisk is a bit louder than the last one but it works so I’m finally getting back to normal. It has taken nearly a week to get everything restored, mostly due to Acronis being unable to restore large files individually. It kept getting stuck about 1.8G into the large files in my laptop. Doing a full disk restore worked fine.

The other annoying issue is Windows search stops working in Outlook after installing Exchange Administrator. Easy fix however. Close Outlook, Run System32\fixmapi.exe. Open Outlook. Let the search reindex everything. You may have to open the Windows Search options, select Outlook, then hit rebuild on the index.

Dell Keyboard layouts – why do they change them

It used to just be a case where certain models didn’t have a keyboard I liked but others would, but now looking at the Dell site, none of the laptops have a keyboard I like. And it isn’t like I’m after some crazy combination. All I want is a machine with the normal Irish keyboard. Even Wikipedia agrees with the format. Same as my D630c.

We currently order Vostro 1510 machines as standard and this problem might have started when they messed up the keyboard having the whole bottom line in the wrong place but the current keyboard model is closer than anything you see on the website. Only difference is the left shift key is bigger, the right on is smaller, oh and the backslash (\) key is on the far right instead of the left. A completely useless layout for anyone who uses the keyboard all the time for coding or working on linux.

Worse than all this is the trend to make the Enter / Return key smaller like the american keyboards. For US people, fine, keep it small since they are used to it, but don’t go trying to force random keyboard changes on us. Hell even the XPS that is on my desk has another layout.

Edit: So the new Vostro 1520 has normal keyboard, or so it looks until you start typing. The bottom line suffers from a smaller than normal ctrl key meaning the left hand side keys (ctrl, fn, windows, alt) are slightly to the left. Not a huge problem and I’d take it over the older problem, but still a problem. Also the keyboards on this model as bouncy. Yes bouncy. Whole keyboard moves when you press the keys in the center.