“Did John call you?”
Head of IT Operations sprinted towards the nearby staircase. He pointed his finger at me right before started climbing the stairs.
“If not, we should meet.” I heard him saying when he reached almost the top floor.
“Sure, when?” I thought he won’t hear me.
“My calendar is up to date. I will put something on for 30 minutes this afternoon.
“Sure, no problem,” I said before heading towards my cubicle.
I knew something was cooking since the business rushed the last project into production.
We met that afternoon in the board room. I wondered how did he manage to get this room in such short notice. This room has the best view of facing lakeshore.
He started talking way before he grabbed the chair. “John could not make it. “We need to solve this ****it”. He threw the papers he was carrying on to the table.
I pretended that I did not know what he was talking about. He first leaned back and then got up in a hurry. “This project has come back to bite us pretty bad. The QA was forced to sign off before conducting a proper performance testing”
He went on. “We have been experiencing outages after outages. Most of our reps on the floor and at the clients’ site are complaining they can’t get their work done”
“Did we round up our usual suspects?” I had Renault in my mind, in that famous scene in Casablanca. “I meant network, databases, etc.?”
“Sure, we did”. But, they have no stake in it. We ruled them out”
He then moved his chair close to me before let me know the reason for the invite.
“I asked you to be here to see if you could help me pinpoint the underlying problem. John told me that you helped him recently with a tool that could visualize data. Do you think we can use this tool to mine server logs? It should be pretty quick? What do you think? John was thinking we should pull you in the loop for help.”
He looked at me leaning forward.
I liked his newfound interest in the tools that I supported. But I wanted to let him know that there will be some upfront work.
“Yes, we have piloted the desktop version of this tool on some business users’ laptops. But, the server-side tools are not yet operational in production. If your team could help to get the logs from different service areas, I am sure we can figure out something quick. Turnaround time is not going to be at the scale you mentioned earlier. If you set some reasonable timeframe, I can put my best person on it to help your team”
“You have time till this Friday.”
I wanted to ask him a couple of questions. I wondered who would know the service level objectives (SLO) of business services? I also wanted to know who would provide my team access to the server logs for this exercise.
His direct report met me the next day with his technical lead. They came with the Service Level Objectives that I asked for. I had a couple of follow up questions. It only took the first five minutes to sort them out. He also had invited five leads from different technical verticals.
SLO set by the business
- 99% of transactions completed within 2 seconds during peak time.
- The average end-to-end response time is less than a second during peak time.
Current Situation
- 99% of transactions completed within 29 seconds during peak time
- The average end-to-end response time was a little over 11 seconds.
“It was much worse on the rollout day.” One of the leads was quick to add.
The integration team did some work to get the 99th percentile under 30 seconds. That prevented sporadic timeouts happening throughout that day. Their “heroic efforts” also brought down the average response time under 6 seconds”.
I learned that the rollout targeted some of the legacy batch processes. Dev teams changed those batch jobs to real-time services.
The business was asking IT to do this for many years. Everyone in IT knew that this was a fair ask. In the batch mode, these jobs were running off-hours when IT systems were least utilized. The job was taking about six hours to complete from end-to-end.
Most jobs used parallel threads to speed up processing. But, nightly batch jobs introduced latency in service functioning downstream. This was tolerable when the company was small.
It was a wakeup for all when our rivals started eating our market share. We realized the impact of technical debt associated with this legacy processing.
End to End Process Flow
I had asked for an end-to-end diagram to understand the new process. Unfortunately, there wasn’t a single view representing the overarching process. As mentioned earlier, we worked with the team leads to build that process map on the whiteboard.
The above is an oversimplified view of the actual process. We revisited this diagram each time when we found a gap in our understanding. The actual process was much more complicated than this. This diagram also helped teams to simplify the process during the following releases.
Averages
The “Trends view” is simple to interpret. It represents the average response time, which was a little over 6 seconds on the day my team got involved. We brought it down to less than a second. The below visuals show path leading to achieving that target.
The 95th percentile view presents the before and after scenario. Before we got involved, the 95th percentile stood at twenty-seven seconds. We brought it under two seconds.
Long-running service calls charts
We used the ‘Top long-running service calls’ to focus on the outliers first. The chart was showing the end-to-end response time.
We identified the outliers and classified problem areas as either hardware-related or code-related. The concerned team took ownership of issues identified to fix the underlying issue.
This view presented the individual transactions within each end-to-end call.
We also looked at the summary stats such as mean, standard deviation, etc.
The analysis of the temporal dimension helped to see the patterns to close the gap.