Rodney, Val, the VP of our Salt Lake group reported he couldn’t log in on Saturday. I told him I’d look into it.
Jeff, our reports said we were available all weekend. I’ll figure it out.
I was managing the email team at a large non-profit. We had just installed a new Microsoft Exchange email system. We tracked the availability stats very closely. As I pulled up the reports for the previous weekend, it was green all across the board. Every one of our reports showed that our systems had been up all weekend.
I pulled my engineers into a meeting.
We need to figure out why our reports show us up all weekend and Val, who I’ll remind you doesn’t trust our team anyway, couldn’t log into Outlook Web Access over the weekend.
It was true, Val and Jeff were rivals for project dollars. Jeff owned the messaging system and Val wasn’t happy that his group had to rely on our team for his messaging needs.
Like many large organizations we had technology silos. I owned messaging. One of my peers owned Networking. Another owned the Directory. While we of course had to work across multiple teams, each team maintained their own goals for availability. We each wrote our own tests and tracked our own outages.
By Tuesday afternoon my team had an answer and an explanation.
It’s not our fault.
What do you mean?
Well, Val was trying to come through the web, the OWA gateway, right?
Yeah, but the OWA gateways said they were up all weekend, so what gives?
The gateways were right. Well, our gateways were right. It was actually the web gateways, not the OWA gateways that had an issue over the weekend. So, Val’s request never even reached the OWA gateway. It was blocked at the edge.
So. . .
So it’s not our fault. Our reporting was accurate.
So you are saying we were actually up all weekend?
Sort of. . I guess.
You want me to go tell Jeff that the email system was available all weekend even though no one could access the email system?
Well, when you put it that way. . .
I don’t blame my team. They had checked the systems they had control over and found that everything within their power to fix was in fact fixed. It wasn’t really fair that they should be dinged for someone else’s failed systems. Their stuff was working and they had the reports to prove it.
But, look at it from Jeff’s point-of-view. Jeff was the CIO. The people he was talking to and reporting to weren’t engineers. They didn’t care that the edge gateways failed but the email gateways were up and running. All they cared about was that they couldn’t use email over the weekend.
So, what do you think? Were my system up or not?
To answer that, you have to consider what my systems were designed for. Were they designed to run availability tests. . .or were they designed to deliver email services?
No. We were NOT up over the weekend. I want you to find out from the network guys when their systems were down. Then, I’ll update our reports to reflect our outage.
Wait, but we weren’t out. It was networks.
Tell me, could our customers get to email over the weekend?
. . .
Nope. We were down. The problem was we didn’t even know it. So, I also want you to work with the network team to modify our OWA gateway availability tests so we know the next time we are down.
Okay. What are you going to do?
I’m going to go talk to the network team lead and find out why he kept our customers from access email over the weekend.
If your customers can’t get to your services it’s your problem. . .even if it’s not your fault.
Rodney M Bliss is an author, columnist and IT Consultant. He lives in Pleasant Grove, UT with his lovely wife and thirteen children and one grandchild.
Follow him on
Twitter (@rodneymbliss)
Facebook (www.facebook.com/rbliss)
LinkedIn (www.LinkedIn.com/in/rbliss)
or email him at rbliss at msn dot com