The engineer thought he knew what he was doing. The task was simple enough, swap out some hard drives on the server. He wasn’t actually an engineer. That was part of the problem. He was a student studying to be an engineer. It was his lack of experience that got him assigned this task.
Hot-swappable drives are designed to be replaced “on-the-fly,” while the server is still running. Unfortunately, our not-an-engineer didn’t fully understand what that meant. He thought he did. He was wrong. And his mistake had drastic consequences.
I recently had some computer trouble. (Why My Kids Have No Shoes.)
Computers are kind of like cars. They all include the same basic pieces, but you can customize them a million different ways. The problem with my computer was that the hard disk that had the operating system installed, what we would typically call the “C” drive on a PC, was failing.
My lovely wife is still angry that 3 years ago I failed to backup one of my hard drives and lost irreplaceable pictures. The hard drive is currently in a plastic bag in my freezer, like some low tech cryogenics lab waiting for one of my truly technical friends to make a try at recovering the pictures.
So, backups, which IT people are terrible at doing for themselves, have become kind of a big deal for me. That’s why my hard drive failing didn’t bother me in the least; because my server has two hard drives (actually it has five, but I’ll explain that in a minute.) One hard drive, the C drive, holds the operating system. The other hard drive holds all the data, including more pictures. My C drive is a single point of failure. Meaning that if it breaks, I have no backup, I have to delete everything and start from scratch. But, that’s okay. I don’t store any data on the C drive. I install from the Windows DVD and that’s about it. It’s a 1 terrabyte drive. (A terrabyte is 1,000 gigabytes, or 1,000,000 megabytes.) Most of it is empty, but I don’t care. Because the really big drive is the data drive, we’ll call it drive E.
Drive E is what’s called a RAID array. Specifically it’s a RAID 5 array. It has four disks that all share the same space. Each disk is 2 TB. Drive E is just over 5 TB in size. You probably noticed that four drives times 2 TB per drive should equal 8 TB, not 5. That’s where RAID, and my ability to stay out of my wife’s dog house come into play.
RAID arrays take a little bit of each drive and store it on another drive. So, the data on disk 1 is also replicated on disks 2, 3 and 4. The reason for this is that disks fail, just as my C drive is failing. With the C drive, I might end up replacing the entire drive. And if I had to do that, all data would be lost.
But, on the RAID drive E, if I lost a disk, I wouldn’t lose any data. If disk 1 fails, then Windows will go to the shared spaced on 2, 3 and 4 where the copies of disk 1 reside. These copies take space that you can’t use for other things. That’s why an 8 TB array only has 5 TB of available space. The other three are used for storing the copies. So, although I had to reinstall the operating system on drive C, when I did, I linked to drive E and all my data, including those pictures were there waiting for me.
Hot swappable drives take the RAID concept one step further. In order to replace my drives, I have to turn off the server, pop open the case, unhook the wires and undo the screws holding the disk in place. With a hot swappable drive, I can replace it by pulling it out of the server without all the rest of that work. When I pull out a drive, the operating system notices it’s gone and redistributes the data from the shared spaces. Likewise, if I add a new drive, the OS notices and redistributes data onto the new drive as well.
All of this was understood by our not-an-engineer. And, since he needed to replace all the drives in the array, he started pulling out old drives and replacing them with the new ones. So far, so good. Except. . .the OS needs a little time to redistribute the files especially when a drive is added. Time, that the not-an-engineer didn’t give it. Depending on the size of your datastore, the copying process could take 15 minutes, or it could take hours. It will finish eventually. Just give it time.
Unfortunately, our not-an-engineer missed that small detail in his education on hot-swappable drives. He started replacing drives as quicky as he could pull them out and push the new ones in.
The server tried to keep up. It really did. But, finally, it succumbed with a crash that set off alarm bells all over our Network Operations Center. The drives were trashed. Physically they were fine. But, the file allocation tables were so much digital confetti. We ended up restoring from tape.
I don’t know what became of our not-an-engineer. He may have gone on to a successful career in IT. And, if this experience encouraged him to read the instructions better, then it wasn’t a total loss. For my part, I was just just glad the affected array didn’t hold any of my family pictures.
Rodney M Bliss is an author, columnist and IT Consultant. His blog updates every weekday at 7:00 AM Mountain Time. He lives in Pleasant Grove, UT with his lovely wife, thirteen children and grandchildren.
Follow him on
Twitter (@rodneymbliss)
Facebook (www.facebook.com/rbliss)
LinkedIn (www.LinkedIn.com/in/rbliss)
or email him at rbliss at msn dot com(c) 2015 Rodney M Bliss, all rights reserved