I have one job at my job. Well, okay, maybe I have a lot of responsibilities at my job. But, I have one metric that I’m responsible for. It’s called System Up Time (SUT.) SUT is pretty easy to calculate. We figure out how many hours our reps are on the phone in a month. Then, we figure out how many hours we weren’t able to take calls.
Sometimes an outage is the fault of my client. After all, they own large portions of the infrastructure. They own the tools. They own the customer validation process. There are lots of moving parts on their side. If we have an outage and it’s their fault, my team keeps track of our lost time, but I don’t have to account for it in my reports.
But, we also occasionally have outages that are our fault. We have local computers. We have things like power, internet connections, our own tools. When an outage is my fault, I have to keep track of every hour, actually tracked by minutes. At the end of the month I have to make an accounting. If we have too many lost minutes, we end up paying a penalty.
Let me talk about scale for just a minute. We have call centers all across the country. A call center can have between 250 up to 700 agents. If I have a service interruption at a large center and it last even 1 minute, that one minutes times seven hundred people. Seven hundred minutes is about 12 hours of down time. The math gets a little more complex. For example, we typically don’t have all 700 on the phones at the same time. And it’s rare that an outage interrupts 100% of our agents’ ability to do their jobs.
But, our outages are also typically not just one minute long. Outages, are 5 minutes or less on the low end, and multiple hours on the high end. And those minutes add up. If you had a company with several thousand agents they might have ten million minutes per month. If we take that earlier 1 minute outage for 700 agents. That puts us into the 99.99% available. That’s a pretty good number. But, suppose my outage was 10 minutes long instead? Well, now you’re at 99.93%.
Suppose you’re out for an hour? 99.58%. The time adds up very quickly.
I have a tiered penalty structure. If I keep the outages about 99.97% I don’t have to pay a penalty. Get lower than that and I’m pay 0.3% up through 6% for terrible availability numbers.
It can be a little brutal, but it’s a good structure. I have a finacial incentive to keep my systems up and running at peak efficiency. I have redundant systems backing up my redundant systems. I have primary and secondary datacentes that each include primary and secondary routers hooked up to primary and secondary circuits. I got to help design my system and it’s very robust.
But, stuff happens. Most months I have a perfect track record. Granted, I have maintenance windows that I can use if I need to gracefully take the system down to replace componants. But, even then, there are times I lose production hours.
The challenge, is that while I’m responsible for the penalty, I don’t often have control over the teams and even the componants that my client uses. I won the problem of lost agents hours, but I don’t often own the solution.
During an outage is the wrong time for me to try to fix this issue. I spend much of my time building relationships with the other departments. I manage one of many clients that my company has as customers. I share resources with the other clients. Time spent visiting with other departments might seem like goofing off, but when something goes wrong and minutes count, it’s great to be able to reach out and get the guys who can actually provide the solution to my problems.
Rodney M Bliss is an author, columnist and IT Consultant. His blog updates every weekday. He lives in Pleasant Grove, UT with his lovely wife, thirteen children and grandchildren.
Follow him on
Twitter (@rodneymbliss)
Facebook (www.facebook.com/rbliss)
LinkedIn (www.LinkedIn.com/in/rbliss)
or email him at rbliss at msn dot com(c) 2017 Rodney M Bliss, all rights reserved