When Failing Gracefully Just Won’t Do
Our computer system was behaving exactly like it was supposed to and it was starting to get annoying.
If you think about it, computers are really good at not dying. I don’t mean your computer or smartphone. Or mine. Those die at the most inoportune times. But, enterprise level systems are pretty good at staying alive.
We have been working on fixing a technical issue at my job for ten days. That’s an eternity in software terms. You figure you can rebuild an entire server in four hours. You can ship new replacement hardware across the country in a single day. You can typically rebuild an entire system in a week.
Ten days is forever. It’s a complex problem, of course. I have four locations and my problem only shows up in one of them. You’d think it was location based. But, if I reclassify one of my Salt Lake City agents and tell the system they are a New Orleans agent, then they have the problem too.
I’ve done my share of troubleshooting over the years. I don’t do much anymore. IT is a young mans game. Not as young as when I was starting fortunately, but still, I’m supposed to provide leadership. We have engineers much smarter than me who do the heavy lifting.
And, of course, we will figure it out. It’s only hardware and software. Replace enough pieces and eventually you’ll find the bit that broken. Our current problem is that we think the bit that’s broken might be in memory of one of our systems. Normally, that wouldn’t be hard to fix. You probably regularly reboot your phone or your computer. Reboot and the stuff in memory is cleared out when the computer reboots.
That’s where systems are different than computers. Most of my systems are redundant. I have two of everything. I even two of my groups of two. The entire system is designed to NOT DIE. If one computer server crashes, there’s another one to gracefully take over. If a circuit goes down, we automatically reroute to a different carrier. We never want to lose our current data.
And that’s my problem. If I reboot one of my servers, the other server will automatically take over, transfering everything in memory to itself. When the first server reboots, the second one will hand back everything that was in memory. Kind of like saying, “Here, hold my coat while I go lie down for awhile.” When you come back, your coat is fully intact, including the money in the billfold and the stinky tuna sandwhich you forgot from lunch. . .yesterday.
But, what if you want to get rid of that stinky tuna melt? That’s when failing gracefully just won’t do. Next week we will have to reboot both servers at the same time. (We’ll be killing you and the guy that held your coat.) We’ve decided it’s the only way to clear out what’s in active memory.
It is surprisingly hard to do. I didn’t think I would be complaining that the systems don’t crash enough. They fail too gracefully.
Rodney M Bliss is an author, columnist and IT Consultant. His blog updates every weekday. He lives in Pleasant Grove, UT with his lovely wife, thirteen children and grandchildren.
Follow him on
Twitter (@rodneymbliss)
Facebook (www.facebook.com/rbliss)
LinkedIn (www.LinkedIn.com/in/rbliss)
or email him at rbliss at msn dot com(c) 2017 Rodney M Bliss, all rights reserved