Detecting problems quickly with a smart IT monitoring system

Jeff Nowland

16 June 2021

“We already use MuleSoft Anypoint Monitoring”

This is the common response given when I ask customers about how they are monitoring their MuleSoft solutions. Now, don’t get me wrong, Anypoint Monitoring is a great foundation for MuleSoft solution monitoring but, unless you have a Titanium subscription, it is just a foundation. It provides useful functionality but if your goal is to have a holistic MuleSoft Monitoring solution – whereby you can quickly detect problems at any layer of the solution – you’re going to need to think about a broad approach.

IT is all about layered abstraction. Each layer improves upon the functional capability of the layer below. It is only the top layer, the Application Context Layer, where business value is realised. All layers below are needed yet provide no business value on their own without the Application Context Layer.

Working from the top down, for modern IT solutions, the stack looks something like this:

In reality, it is not that neat. Layers overlap in some contexts and don’t exist in others. Even IaaS, PaaS and SaaS solutions have these layers, but the service may abstract them from view or access. But, importantly for all, we do end up at the bottom, where electrical signals are traversing a wire – good ol’ ones and zeros, the foundation of IT. The important point to note is that layer-to-layer awareness only works top-down. The layers above can be aware of the layers below, but not the other way around. This is important to understand for how you monitor IT solutions.

Monitoring has a core goal – to maximise application uptime whilst providing peak performance. This is achieved by detecting when the baseline of normal has been breached, allowing for remedial actions to occur. If all layers combined are required to deliver the application and layer-to-layer awareness is one way, then it’s worth considering what exactly you are monitoring today and how you are monitoring it. For example, if a problem exists within the Network layer that impacts the Application Context Layer, could you quickly identify where the root cause is when you see an ERROR in your application logs? And if, so how quickly? Would you even know there is a problem? From a historical approach to application monitoring, the answer was no.

Historically, IT application monitoring was performed by IT Pros, whose primary skill and experience was at the Operating System layer on down. It was common to monitor application performance by measuring metrics at the Operating System layer, with Perfmon or a similar diagnostic tool. You would measure memory, CPU and network utilisation and look for spikes in these metrics that may point to a problem with the application. This was better than nothing, but often just barley. This approach attempted to diagnose application-layer problems via diagnostic monitoring at the Infrastructure layer. When an alert triggered due to an Infrastructure metric breach, say 90% CPU, it was commonly due to a regular fluctuation in application workload. The application itself was in good health and time was wasted investigating. It often became a case of “the boy who cried wolf” and alerts would be ignored by the team.

This was compounded by when an actual issue at the application layer occurred, say an ERROR experienced by the user, it was not detected as the infrastructure metrics indicated standard application performance. When you put this approach in the context of the layered abstraction example provided above, the problem is clear. Teams were trying to detect top-tier Application Context Layer issues by looking at metrics gathered from the Operating System layer, which is not Application context aware. It made the diagnostic information almost worthless.

Application Monitoring as a practice with associated tools has become more advanced over the last five or so years with the proliferation of Log Aggregation and Application telemetry tools. But tools alone only get you so far. To effectively monitor MuleSoft, or any IT Solution for that matter, you need to consider the multiple layers of abstraction involved to deliver the application to users (users in this context can be human or other systems). And within each layer, you need to consider the types of problems that can occur, how they can be detected and how they should be managed when they occur. This is how you achieve Holistic Application Monitoring – true breadth and depth of visibility of your application.

Leveraging MuleSoft monitoring to its fullest

Using MuleSoft solutions as an example, the first step in achieving holistic MuleSoft monitoring is to start from the top layer – the application context layer. In the MuleSoft world, this is the API layer – the specific API solutions that have been built for organisation. It is where the business value resides. MuleSoft Anypoint monitoring operates at the runtime level, i.e. it is not API context aware as the MuleSoft team did not write the code. So how do you achieve monitoring at the API layer? You need to implement a solution that can operate where the information on context is the richest, i.e. in the executing code for the API itself. Sometimes this is best done by creating runtime error handler functions that are called when an unexpected event occurs. These functions capture as much context information as possible, package up and send for transport. Information should include stack traces, error codes, performance, or other diagnostic data. Importantly, the function is run in real-time during code execution, it provides the richest information source on the issue.

There are a multitude of options for the “transport” solution of the information captured at this layer, some better than others. “Low tech” solutions include email notifications and writing information to logs followed by log shipping. These are less desirable because of the delay introduced. Email is a relay service, not a real-time delivery service. By the time you see the email in your inbox, significant time could have been lost. Log forwarding requires that logs are then processed and indexed before you can utilise the information. Large logs mean a slower process. However, these do have the benefit of being a robust, higher fault tolerant method. Alternative options include direct integration (e.g. webhooks) with tools such as Azure App Insights, Datadog, Elastic – tools that provide purpose-built functionality for this. Integration with these tools from within your APIs is simple when done correctly. And, best of all, the webhook call is done in real-time.

Whichever option you choose – getting the monitoring implementation of this API layer right is the most critical step. And, regardless of the application type, governance is critical – it is important to establish a standard approach to be followed by all development teams. It should dictate how different events are to be handled – Errors vs identified performance fluctuations etc. Care needs to be taken to implement effective logic in the solution itself for the different events. For example, in some APIs it is not useful to create an Error event when a 401 response occurs if it is expected flow for the process. It is not an error after all. Sloppy logic implementation will lead to a very noisy monitoring solution. And the noisier your monitoring solution, the less effective it is and will require hours of tuning investment before it becomes effective.

Governance should also dictate the use of best practice approaches that significantly improve the context information such as persistent transaction IDs wherever possible. Within your MuleSoft solution, this means ensuring a persistent GUID continues throughout all internal API calls. This allows for the stitching together of rich contextual transactional puzzle pieces to ensure you have the full picture. Where possible, this should be extended to consumer applications – those outside of MuleSoft but call MuleSoft APIs or are called by MuleSoft APIs. A singular transaction GUID should be passed to and from all applications during the end-to-end transaction. When implemented correctly, it will provide a view of the entire transaction in context, enabling rapid diagnosis of root cause. In turn, this enables rapid recovery and helps maximise application uptime.

Implementing some form of log aggregation for your APIs is also an important consideration, particularly when considering the need to process diagnostic data from multiple sources, not just the APIs. MuleSoft Cloudhub APIs can be modified to ship logging data to a log aggregation tool/service. Even though the process can have a time cost, due to the size of logs, the capabilities that can be applied once logs are indexed is significant – with options like AI analysis of the log files to detect problems. There are considerations but, in general, integration with your APIs would be recommended:

https://docs.mulesoft.com/runtime-manager/custom-log-appender

What about the layers below? Effective monitoring should also be established at each layer. At each layer below, you should consider how you can effectively monitor and detect each of the following types of events for the relevant tier:

Error – A fault has occurred on this tier
Performance – An unacceptable performance threshold has been exceeded for the tier
Capacity – A capacity measure of some kind has been exceeded for the tier
Availability – A measure that the solution tier is functioning as expected

For MuleSoft, the Application runtime environment tier is where Anypoint Monitoring provides significant value. Anypoint Monitoring will allow monitoring of the underlying worker processes. It can detect when workers fail, as well as CPU bursts and High Memory utilisation. You should have a clear plan for what monitoring rules should be implemented for Anypoint Monitoring.

If API Manager is also in use, then the monitoring capabilities available within this service can go a long way to addressing the performance monitoring requirement for this tier. API manager provides a capability for detection of performance issues by measuring transaction response times. By establishing a baseline of what “normal application performance” looks like, alerts can be configured to be sent once the performance degrades:

https://docs.mulesoft.com/api-manager/2.x/using-api-alerts

If you do have a Titanium subscription, which in many cases would be recommended, there are additional monitoring capabilities that can help maximise uptime for your APIs.

Moving down the stack, as MuleSoft Cloudhub is a Platform-as-a-Service solution, the AWS layer and the Physical layers below are abstracted away, so no data can be gathered from there. However, if you are running MuleSoft in an “on-premise” deployment model, the Operating system metrics and virtual machine metrics should also be monitored and consolidated. Even with elastic cloud capabilities, resources often have logical limits enforced and cannot dynamically scale – e.g. local disk on the VM. There is a long history of applications where rogue log file growth has brought it down in Production. So knowing which areas to measure at this layer is important.

Finally, at the networking layer, effective ways to monitor and detect the four critical event types are also important. Establishing heartbeat monitors or ways to perform synthetic transactions against the Production APIs, ideally from multiple areas within your consumer network (e.g. testing internal and external consumption of APIs) can be a way to gather the information needed to detect an underlying issue at the network layer. When the monitor receives a 404 response when called externally, but a 200 response internally you have a very good indication that the issue is at the network perimeter layer.

Additionally, Networking devices such as NGINX, Load Balancers and perimeter devices can be integrated to Log Aggregation solutions, with similar monitoring profiles established across the four critical event areas. Ingestion of the right data is key to detecting issues and to resolving them swiftly.

Enhancing the approach is to consider what sits beside the stack and how it too can be monitored. This would include monitoring any dependent application performance. Examples here include message queues and Databases that may be used to deliver API solutions. Databases can also be integrated with Log aggregation solutions. Further, diagnostic logs from source and target applications can also be aggregated together and, if the unique GUID solution implemented, provide comprehensive end-to-end transaction tracing capabilities in near real-time.

A key point about monitoring and consumption of all of the diagnostic data collected is that it should not be viewed as a replacement for having a team of skilled people who can deal with issues as they occur. Indeed monitoring, and automation more broadly increases the importance of the role people play in resolving critical issues. Why? Because the monitoring has already helped identify that you are dealing with an exceptional case – the baseline of normal has been breached, hence the alert. Automated recovery solutions are always challenged to handle exception cases for the very reason that they are exceptions. This is still the case even with advances in AI and application fault-tolerance design. Monitoring is still required and intervention from the team of skilled individuals applying creative intelligence is commonly required to resolve exceptional issues.

Looking for something that you can deploy quickly and easily? Our MuleSoft Managed Services comes with solutions for holistic MuleSoft application monitoring. With years of experience working on MuleSoft implementations, we know how to do this, and do it well with a proven methodology and processes. If you’d like to know more about our MuleSoft Managed Service, contact us for an in-depth discussion.

Jeff Nowland is Director of MuleSoft Managed Services at Capgemini and has 10 years’ experience in Managed Services

“We already use MuleSoft Anypoint Monitoring”

Leveraging MuleSoft monitoring to its fullest

Related posts

Getting ready for predictive lifecycle assessment models

From talk to action: Practical steps for your ESG journey

Affordability: Overcoming a major barrier to widespread EV adoption