Capgemini Australia

When things go wrong! Examining your IT Crisis Point

And this is no more apparent in IT when a Priority 1 (P1) Incident occurs. The impacts of a P1 on an organisation can be significant. Often, during a P1, money is walking out the door (or not getting in the door) until it is resolved. Customers leave your business. Contracts get cancelled. Company reputations and future earnings are impacted. News can be made. It is not pleasant.

Having worked across multiple functional areas of organisations, there’s nothing quite like the pressure created in IT when a P1 occurs, and you’re cast as the individual in the spotlight to fix it. Yes, every job has pressure. Every job has opportunities for failures to occur – project deadlines are missed, financial forecasts are wrong, significant mistakes are made. And, sometimes, it can be a very public failure too. But there are not many jobs or roles that not only put you in the spotlight because of a failure, but do so with your problem-solving ability and internal thought process in full view of what can often be the most senior stakeholders in and out of the organisation. The analogy of turning up to school in your underwear can feel not too far off.

In the middle of a P1, everyone wants to know what is going on – when did it start, how many customers are impacted, why hasn’t it been fixed yet, and when it will be fixed. Tempers flare. Fingers are pointed. Stakeholders want constant updates on the status. The impact is that there is often very little time to effectively focus on actual problem solving during these Incidents – with small windows to analyse the problem followed by conference calls to report on the analysis and impacts. You are pulled in one hundred directions at once. “Check this.” “Now, look at that.” On each call, someone more senior joins, and you repeat what you know again and again. Stakeholders will want to throw more people at the problem. More and more cooks enter the kitchen with a plan on what to do next.

It is intense. Not everyone can handle it. Even senior, high performing team members with significant experience in IT – team members who shine in every other aspect of their work can struggle to perform in this high-pressure situation. Those that do it well have a combination of the abilities – a mix of the right skill, experience and personality. And, often, effective P1 resolution comes down to 2 people in an organisation – a Technician and a Leader.

During the P1 Incident, the Technician will do what good technicians do – collect and analyse technical information in a methodical way to determine what is wrong and where. An application failure root cause can reside in one of 2 locations – on one (or multiple) of the appliances that hosts the application or somewhere else. Technicians use this knowledge to capture diagnostic data to narrow down the problem to this either/or calculation to localise the issue. “Server A can’t contact Server B, but Server C can contact B? It must be Server A or the path between Server A and Server B where the problem exists. My next step – identify if it is Server A or on another appliance that forms or creates the path”. Once they’re down to the appliance, they can analyse the parts of the application deployed there to localise further. “The experience layer accepted and processed the request. That looks ok. I’ll check the process layer next and work down the stack”

These people work fast through this process. And when they can’t do or get something themselves to validate, they know what they’re after so they can direct others on what needs to be done. In highly distributed application environments, the Technician may be leading other technicians on activities – i.e. be less hands-on. However, in most circumstances, it is best to have a singular Technician directing others, rather than a team of technicians collaborating. I.e. you still have “a Technician”.

The Technician must be able to perform under pressure. Stealing from a movie – “some people, you squeeze them, they focus. Others fold”. In this situation, you need to be someone who can focus. Someone who can block out the noise (and there will be an awful lot of noise), analyse and solve problems. These people can manage their natural anxiety responses – heart rate increase, mind racing etc. – and channel the adrenal energy into the tasks to identify and resolve the problem.

Additionally, the Technician needs broad IT experience. They can think across the multiple layers of the abstracted technology stack that are in play. They understand the applications and application architecture at many levels – distributed vs monolith, high availability and disaster recovery (side note – I am amazed by how many in IT do not understand the fundamental difference between these 2 concepts. It is possibly because IT did itself no favours and represents in short form as a singular – HA/DR meaning High Availability Disaster Recovery). But they also understand operating systems (memory allocation, OS policies, local security), virtualisation (compute resource allocation, physical host architecture), networking (DNS, Protocols, Proxies, QoS, Devices) and how to rapidly troubleshoot each.

Its why developers can often struggle in these scenarios – they generally think of how to solve problems only within the domain where they have expertise – “I’ll just need to add some debug logging to my code”. And, if the problem is visible in their domain yet has a root cause outside their domain, it can be challenging. The P1 Technician is as comfortable analysing a Wireshark trace as they are a stack trace.

The best technicians can quickly visualise the end-to-end picture and rapidly assess information and detect what “doesn’t seem right” about a data set. They use this to find the right entry point to investigate and progress from there. They are someone who applies a methodical approach to problem-solving. Those that struggle jump from A to K to W, back to A to C, back to A. Those that excel will find empirical evidence to rule out A, then move on. They also know much earlier than others when to change the investigative approach. Many inexperienced technicians can struggle to extract themselves from rabbit holes in these situations, which is a challenge as there are often red herrings aplenty.

The best technicians know that the sledgehammer – a restart – rarely, if ever, is more than a temporary fix and is not a recommended approach to resolve. It is not a fix for an underlying issue. Stakeholders will apply significant pressure for a restart to remediate the issue quickly (mainly because IT help desks have shown it is often a good fix for your PC problems). However, the same circumstances usually rapidly reoccur, and you will have exacerbated the issue by losing valuable time to investigate the problem while creating service instability. Intermittent instability is often worse than zero services for customers.

Once they narrow down the cause of the underlying issue, technicians lead the remediation effort – possibly this is a temporary workaround instead of a permanent fix as it can be much quicker to implement. Activities possibly involving scripted fixes, network routing updates, configuration changes, or if developer code changes were needed, ensuring the fix is propagated across the host platform.

What about the Leader? What do they do, and who are they? The Leader is simply the person who sees they are the one accountable for resolving the situation. They take ownership. Note the term “who sees”, not “who is”. What’s the difference? Well, hopefully, they are one and the same person. But the one “who is” accountable may not be the right person. Leadership can be circumstantial, and, much like Technicians, some Leaders don’t lead well in pressurised situations. And this can create a void to be filled. The person “who sees” they are accountable will accept ownership of the situation. And this acceptance creates the leadership needed. Now “is accountable” does not mean they were responsible for the issue or have to fix it because it is their job. “Is accountable” means they know they are the one person most able to resolve the situation. Is it a Service Delivery Manager, a Team Manager? Maybe. If you’ve got the right people in the right roles, then hopefully it is.

What does the Leader do? The Leader is the person that knows who the Technician needs to be and then creates the space for the Technician to work. They understand the Technician’s activities and can relay these in simple terms to the stakeholders, allowing the Technician more time to investigate instead of explaining. They have the Technician’s respect and can course-correct the Technician in need – even the best Technicians can get stuck down rabbit holes.

The Leader has technical background – this is what can sometimes force CXOs to be bystanders during these events. In these situations, even the best Technicians tend introduce lots of superfluous information and IT jargon into the investigative process. Leaders understand this information, but distil – they use clarifying questions and statements to narrow down to something closer to yes/no responses from the Technicians. They are then able to articulate status accurately to stakeholders.

They also manage stakeholders. They not only communicate the status, but they’ll also hear out suggestions made (and there will be a lot) and help progress any that don’t impede progress. But they’ll temper those that will. They set the plan and expectations with stakeholders on what will be done next and when to reconvene.

Then it is a matter of the Leader and the Technician working together. If the Technician needs something, e.g. another Technician to gather specialised information, the Leader gets the right person engaged. If the Technician needs time to gather data or analyse data, the Leader creates it with the stakeholders. If stakeholders are requesting multiple conflicting activities to occur in parallel, the Leader sets the priority and expectations of what will occur and in what order. It’s an exhausting event for the individuals involved.

Generally, after the P1 Incident, the team involved with the management of the component that failed are immediately tasked with creating a Post Incident Report, a PIR. PIRs attempt to capture the root cause of the issue. Here, the Leader will be the one who pushes the Technician to get the root cause. Technicians often want to take the shortest path on the analysis (understandable, given what they just went through). But the team should strive to identify the actual root cause via Root Cause Analysis (RCA). “The application failed due to insufficient memory” is not a root cause. There are several strategies to identify the root cause correctly, e.g., the 5 whys. But this is one area where the skill and experience of the Leader is the most critical factor. They will know when you’re below the superficial reason, and true RCA is captured correctly.

With the actual Root cause effectively-identified, remediation plans can be created and implemented to provide a permanent fix, if not already done. This can include technology changes, process changes and/or people changes to reduce or eliminate the likelihood of reoccurrence.

Having effective monitoring of solutions hasn’t been discussed in this blog post. It’s incredibly critical to deploy effective monitoring to maximise application availability, so why not mention it? For the very reason that a P1 is an exception case. Your monitoring (hopefully) indicated that you have an issue before the customer did, but if it could pinpoint the issue to a known problem, you would have solved it already. You would not be in the scenario above. Instead, you are in the scenario where monitoring data does not help, or worse, it was incorrect and hiding an issue.

However, what is commonly missed in a PIR is potentially the most crucial part is related to monitoring. Application uptime and performance needs to be maximised for an organisation to achieve the maximum return on investment – be it monetary or other. A P1 represents a significant failure related to this. Therefore, unless you ask yourself the following questions, you’re destined to go through it again, just with different parameters:

I’ve covered the approach to holistic application monitoring in another post. That post covers what should be monitored as a baseline. The questions above should identify your monitoring gaps against this baseline that would have effectively detected the situation occurring, helped pinpoint the issue and, hopefully, allowed a resolution with minimal impact instead.

Also, keep in mind that the fix you implemented to resolve the P1 was very likely local to the specific issue encountered. It may stop that issue but is unlikely to resolve similar issues you haven’t encountered before. Therefore, identifying the answers to these 2 questions and swiftly implementing solutions to improve your monitoring capabilities. This is how you continue to improve application uptime.

Learn more about our MuleSoft DevOps Managed Service or contact us for an in-depth discussion.
Jeff Nowland is Director of MuleSoft Managed Services at Capgemini and has 10 years’ experience in Managed Services
Exit mobile version