Skip to content

Why Does It Take So Long To Fix IT Problems?

October 11, 2017

Twenty-six days.

It took my team 26 days to resolve our latest issue. We resolved it yesterday. The problem occured on September 15. Initially we thought it was going to be easy.

On the night of September 14, we updated to the latest version of software for one of our core systems. While Windows pretty much forces you to install updates, and Apple products really encourage you, with large enterprise systems, it’s not quite so easy.

I used to work for Microsoft when my mother owned a CPA firm. I was not only her tech support, but because she was family, I was her software update service. Everytime Microsoft updated their software we would have the same conversation.

Rodney, I see the new version of Windows is out. Should we update all our computers?

Mom, what does the new version do that your current version doesnt’?

I don’t know.

Is your current version causing you any problems?

No. It’s been running great.

Until the new version gives you something the current version doesn’t, you shouldn’t upgrade.

Enterprise systems are sometimes like that. But, at the beginning of September we had a different issue. Our agents were having random problems with their tools during busy times. We spent days working on it and narrowed it down to one of our enterprise systems. The manufacturer suggested that we upgrade to the latest version of the software.

We did. It fixed our first problem and introduced the second problem.

Now, this seems like it should be easy to diagnose, right? We updated this piece of software and a problem showed up. It must mean the problem is with this software, right?

Wrong.

Sure, that’s the first thing we checked, but when you are dealing with big complex systems, there are lots of parts that have to work with each other. The parts are often supplied by different vendors. You might have a phone system from Avaya. Network hardware from Cisco. Telecommunication circuits from Verizon, AT&T, Sprint and your local teleco provider. Desktop computers from Lenovo. Desktop software from Microsoft. And the list goes on and on.

For our problem that started last month, the first thing we did was check the software we just updated. Except it came back completely clean. Not a ripple.

Next we started trying to isolate the problem geographically. It was only affecting my users in Savannah. The other three sites were fine. It must be a problem with that location right?

Wrong. Each site is configured within the system with a series of group designations. If we took a user in Salt Lake City and reclassified them within the system as being in Savannah, they suddenly also had the problem.

We have a partnership with our client. They own part of the infrastructure and we own part. We started holding daily calls. My engineers would get on the phone with their engineers and we would invite engineers from the company that makes our software.

Didn’t I just say it wasn’t the software? Yes, but all we did was verify that the new update was working correctly. We knew that something was broken, but we didn’t know if it was on our side, the client side, or the vendor side.

We met everyday at noon.

I used to be an engineer. I used to be smart. I’m now a manager and the engineers quickly lose me when they start talking technical. But, I was on the calls because ultimately I am the person responsible for our IT infrastructure. The VPs come to me and say,

Rodney, what’s the status of the issue in Savannah?

Well, we met again today. We sent the vendor more logs of both good calls and bad calls. We have another meeting tomorrow.

Okay, keep us posted.

As I said, yesterday after weeks of study and hundreds of hours of engineering time invested, we fixed it. The solution was a single setting on one of our grouping’s configuration. The problem had existed for months. In fact, it was a holdover from a project that we tested and then cancelled last year. This location was brought online during the period we were testing that project.

Because the project was ultimately cancelled, none of our other group configurations used that same setting. The older version fo the software was not as strict about letting a badly formed packet get through. The updated software started checking that setting and that’s why it failed.

In hindsight, the problem is pretty easy to identify. And it was simple to fix. In fact, the engineers fixed it in the middle of the day and everything started working. But, it took weeks of careful study to realize the different configurations between a good group config and a problem one.

A hotel in Albany New York lost their furnace in the middle of the winter. The manager called an emergency repair service and the technician arrived soon thereafter. After the manager explained the issue, the repairman descended into the basement and opened up his toolbox. The only tool inside was a large rubber mallet. The repairman carefully measured a certain distance down the ductwork and when he’d found the right spot, gave it a tremendous blow with the hammer. The furnace immediately started up and the manager was thrilled.

Two weeks later he got a bill for $10,000. He called the repair company, “That’s outrageous. I was there the whole time. All you did was smack it with a hammer! I want an itemized bill.” Two weeks later the new bill arrived:

HITTING WITH THE HAMMER: $5
KNOWING WHERE TO HIT WITH THE HAMMER: $9,995

If you know where to hit with the hammer, it’s easy. The tricky part is figuring out where. That part takes a bit longer.

Rodney M Bliss is an author, columnist and IT Consultant. His blog updates every weekday. He lives in Pleasant Grove, UT with his lovely wife, thirteen children and grandchildren. 

Follow him on
Twitter (@rodneymbliss)
Facebook (www.facebook.com/rbliss)
LinkedIn (www.LinkedIn.com/in/rbliss)
or email him at rbliss at msn dot com

(c) 2017 Rodney M Bliss, all rights reserved 

Leave a Comment

Leave a Reply