Keep the lights on – The SRE way!

In the recent past we have seen a spurt in the number of clients asking for SRE services. In some cases it was a very conscious and explicit ask for SRE services, while in others it was more of an implicit ask for similar outcomes without really calling it out as SRE. This article, co-authored with Manoj Tharwani, aims at putting across our point of view on what SRE is. While this may be just a teaser, feel free to reach out to us in case you are looking for more details. At Capgemini, we have invested in building this capability that is enabling us to implement this concept at several of our clients.

Introduction
In simple terms, reliability is defined as the probability of success. However, in the application world reliability is talked about in terms of availability and measured in context of the frequency of failures. Reliability is important as it can help build or lose confidence in a product and an organization’s brand reputation.

Especially with the current IT system landscape that typically comprises of several moving parts and a multi-cloud-based setup that poses even more complexity, a traditional approach based on the philosophy of “prevent system from failing” doesn’t quite work. With that many moving parts, there is bound to be a disruption somewhere, resulting in failures. The philosophy hence needs to be more like “expect failures to happen; build systems that are resilient to these failures”. That is where the concept of SRE or Site Reliability Engineering (a.k.a Service Reliability Engineering) kicks in.
SRE is all about applying a software engineering mindset to system administration. As a software engineer, you look at the business requirements and develop the system. Likewise, an SRE needs to look at how each disruption can affect the business requirement and then find a solution for it accordingly.
Agile-focused, product-driven approach and IT – OT integration have been key drivers for the growing demand for SRE today.

Originating at Google as early as 2003, the concept started with a team tasked with the responsibility of maintaining Google’s website “as available as possible”. They did that by simply applying the software engineering concepts to system administration topics – which later formed the basic tenets of SRE, as described in the online book published by Google.
Like most enterprise constructs, one does not need to “mimic” the same methods as done by Google. While you need to assess these practices in the context of your enterprise, there are certain basic tenets of SRE that must be followed:

The first step is to agree upon a set of SLIs (Service Level Indicators) and SLOs (Service Level Objectives) so as to know the targets and measures.
It involves accepting failure as normal and manages an “error budget” which is used to strike a balance between system updates and system stability.
The SREs are neither part of the dev team nor the ops team. This needs a separate central team that takes the E2E system accountability spanning across apps, infra, backend, frontend, middleware, etc.
Another objective of SRE is “reducing toil”; hence automation needs to be a key focus area.

DevOps vs SRE
One obvious question that often gets raised, is about the crossover between SRE and DevOps, and rightly so. There is a significant overlap between the 2 concepts. Both tend to address the silos between “dev” and “ops”. Also, in terms of practices followed, there are a lot of parallels. However, the approach and objectives are quite different in both cases.

Agility first vs. Reliability first:
The main objective of DevOps is to increase business agility – how do we release new features faster? How do I get my defect-fixes in production sooner? This is largely done through cross-pollinating the dev and ops teams to have common goals that are implemented through tools and automated pipelines for promoting code faster to higher environments.
Objective of SRE is to ensure that while the business agility is pursued, it is not done so at the cost of overall reliability of the system. This is typically achieved using a separate central team that governs this.
Failure Tolerance levels:
DevOps looks to ensure there are no failures. SRE on the other hand accedes that failures are inevitable; instead it focusses more on ensuring a continued availability of core business services with minimal impact through chaos engineering & destructive testing practices.

We have seen several cases in the recent past, where in spite of having complete DevOps implementation, companies continue to bleed millions of dollars when their core systems go down – SRE will help plug that gap!

We believe SRE is not different, but in fact see it as a natural evolution of the DevOps maturity model, as depicted in the graphic below:

Another aspect worth mentioning is around the scope of SRE. While the DevOps concept, focusses on bridging the gap between “development” and “operations” teams, SRE extends that further by bringing in the focus on architecture as well. This ensures that the system resiliency is built into the system by-design so as to quickly react and recover from unexpected disruptions.

What would it really take to be an SRE?
SREs are expected to cover the entire spectrum of IT systems – it combines deep awareness of technical infrastructure, operating systems and computer networking with an attention to higher-level service level objectives (SLOs) to maintain a focus on business-relevant activities.
They focus on solving problems by building software components or features which prohibit the problem to re-occur in future (if not, then at-least making it less painful). It is thus often recommended that the SREs come from a software engineering background with some awareness of operations, rather than the other way around.
Following technical skills are required from this role:

Cloud / Infrastructure
Software Engineering
Architecture
Monitoring & Support
Testing / Chaos Engineering
Automation
DevOps

Additionally, non-technical skills such as problem-solving, teamwork, working under pressure, and strong written and verbal communications are key to their success.

Conclusion
Every enterprise – large or small, at any given point in time has multiple applications under development and code deployments (and re-deployments). A lot of these enterprises have a mix of legacy and modern applications, supported by separate Development and Operations team. While DevOps ensures a smooth, automated, pipeline-driven approach to these deployments, there needs a dedicated focus on ensuring the availability of end-to-end business functions. This is what exactly SRE’s brings to the forefront and garnering a lot of interest in the industry.

Enterprises in past have been working on reliability in some shape and form. Hence first step to the SRE journey should consist of looking at the applications holistically from an end customers perspective, gather what is existing and determine the gaps in service reliability. This would give any organization a view of how they are placed when it comes to reliability, what needs to be addressed immediately and help plan.
At Capgemini, we help enterprises with “Reliability Assessment” by giving them a view of how they are placed when it comes to reliability, aspects which needs to be addressed immediately and how the steady state of service should look like.
Not just that, Capgemini also supports enterprises addressing critical issues impacting reliability through the “SRE Jumpstart” offering. We ensure improved availability and reduced outages across all applications addressing performance bottlenecks. We define necessary processes and build tools required for service reliability, focus on automation and bring applications to a steady state of service maintainability.

For more details on our SRE service offerings, please feel free to reach out to me or Clifton Menezes and we will be happy to collaborate with you in your journey towards keeping the lights on, the SRE way!

Keep the lights on – The SRE way!

March 27, 2020

Aliasgar Muchhala

Expert in Cloud, Strategy and Transformation