Skip to Content

Site reliability engineering in ADM services: A practical approach to predictive maintenance  

Aliasgar Muchhala
30th May 2024

It has been estimated that application maintenance costs account for more than 90% of the total cost of application development, as compared to 50% a couple of decades ago.

The cost of software and application maintenance has reached an all-time high as the result of an increasing demand for business agility. To meet the demand, organizations are constantly deploying new features across an increasingly complex landscape of distributed computing, cloud, micro-architectures, and on-demand services amongst others. This reactive maintenance works well when the stakes aren’t so high, but a complex system requires a measured approach to maintenance.

Preventive maintenance takes a more defensive, proactive approach to averting potential outages through implementing various checkpoints across processes to assess the systems’ quality. However, sometimes this morphs into an over-regulated, ultra-defensive system riddled with tollgates before a problem gets solved, which defeats the purpose of being able to respond to issues in a timely fashion.

To avoid the bureaucratic bottleneck, many enterprises turn to an AIOps platform for anomaly detection to predict and detect failures and automate recovery actions with minimal engineer intervention. However, enterprises struggle to derive value from AIOps platforms due to the massive amounts of data needed to make accurate predictions, making AIOps a more theoretical approach to maintenance.

Only 12% of companies adopting AIOps use it as part of their day-to-day operations – nearly 40% don’t use it at all.

To evolve IT systems into the hyper-efficient, self-healing systems that can run without becoming a maintenance nightmare, we need a different approach.

Insight-guided predictive maintenance

Automating maintenance is the ideal – a self-healing system that requires little outside input saves person-hours and operating budgets, but it can be amplified with a Site Reliability Engineering (SRE)-driven approach to AIOps.

SRE uses a framework to align business key performance indicators (KPIs) with value stream flow metrics, which establishes service-level indicators (SLIs) and service-level objectives (SLOs). These SLIs and SLOs are indicators of what a business expects to deliver for the best customer experience, otherwise known as service-level agreements (SLAs).

When the SRE parameters are combined with AIOps, teams can analyze only the data streams deemed necessary according to the established SLIs and SLOs. A unified observability dashboard provides a holistic view of the current state of apps, infrastructure, and data, helping to identify potential problem scenarios. This SRE data analysis, augmented by Generative AI capabilities, precedes any decisions on automating a solution. If automation is found to provide value, the team will build predictive models specifically for those scenarios. Depending on the maturity of a system, these models can also automatically trigger the appropriate corrective measures, preventing the problem scenarios from actually occurring.

We, at Capgemini, have implemented a version of this solution for a client in the retail industry, using an SRE-driven approach to AIOps across their SAP landscape. Events were tracked using infrastructure monitoring tools such as the SAP Solution Manager (SolMan) to derive correlations across time, and scenarios were created using a configuration management database. The scenarios were studied by subject matter experts to uncover insights and triage situations before they escalated. AIOps auto-resolved some issues based on relevance, applicability, and solution availability.

The solution resulted in a 96% reduction in the number of day-to-day alerts that required monitoring team interference. In addition, zero alerts from manual errors were missed, resulting in an improved quality of service.

If you’d like to know more about an SRE-driven approach to application maintenance services, visit us here, or you can contact me directly.

Aliasgar Muchhala

Global SRE Lead and Global Architects Lead
A strategic, focused, business-oriented leader and Capgemini Level 3 Certified Chief Architect, with an impressive record in architecting and building cutting edge systems that leverage new age technologies to enable clients transform their business, reduce costs and improve efficiency.