Skip to Content

Site Reliability Engineering: DevOps thought out to the end

Andreas Lutz
August 6, 2020

When DevOps is introduced to an organization, those involved, including specialist departments, development and operations are likely to be in favor of these changes and will often support it. In reality however, there will be insignificant differences to the previous ways of working. The teams might organize a meeting to hand over the code from the development to operations or they could even create a joint go-live checklist, but it rarely goes beyond this.

An additional DevOps team is actually an antipattern to DevOps. Instead of bringing development and operations together, a team is created that adds a third silo. The classic operations business remains separate – and reaches its limits especially in the times of agile, microservices, and ever smaller high frequency deployments. This is where we must challenge our original assumptions.

What is BizDevOps?

First off, DevOps is a culture. The name is an acronym that unites development and operations, as both areas should not work in silo but rather in cooperation. BizDevOps now extends the principle to the business: to work together in a team of representatives from the business side, development, and operations. Finally, the Site Reliability Engineering model ensures that this culture is fully implemented.

It is important to consistently restructure the teams according to products and to avoid the old mistakes in DevOps implementation. In a number of cases, companies have tried implementing DevOps with only this principle in mind, using the existing team structures or by creating an additional DevOps team to try and “plug the gap”. Unfortunately, neither will work.

The product determines the organization

A forward-looking vision is that of a product-centric organization with product-oriented teams. Such an organization no longer divides itself into siloes according to its capabilities and is evaluated accordingly. It forms a cross-functional team for each product. In concrete terms, this means that there is no longer a business side, development, and operations. Instead, there is one cross-functional team per product, which could be a microservice or a combination of microservices. Each of these teams includes product owners, architects, developers, and colleagues from operations.

This team structure makes it easier for all members of a product team to pursue common goals instead of competing goals for each of the siloed capabilities. The most important team goal is the success of the product. It encourages collaboration along the entire value chain. But for a BizDevOps product team to reach its full potential, reliability engineering is required.

The model for the product-centric organization

Companies can realize the vision of a product-centered organization with site reliability engineering or the site reliability engineering model, because it implements DevOps and the “one team” idea. The model consists of two components: the site reliability engineer and the self-regulating system – and it is important to implement both. The danger of only half an implementation is comparable to that of the Agile principle. If you work in sprints, but only deploy once a year, you are not agile.

  1. The site reliability engineer

The site reliability engineer has two tasks: a maximum of 50% of his working time should be dedicated to the firefighting mode, ensuring stable operations. The remaining 50% of his time should then be devoted to engineering tasks such as the automation of reoccurring manual tasks. The SRE is a highly experienced and talented coder, who is excited by the idea of removing any redundant operations work. They are familiar with operations and the cloud, masters of actionable monitoring systems, understand architectural concepts and know how to fix code. SRE’s with these types of skills will help to reduce the workload in the long run. They are also an important advisor for the product owner (PO), architect, and developers in terms of service stability, monitoring, performance, and other non-functional requirements.

  1. The self-regulatoring system

Product-oriented team can also succumb to the failing reliability of elaborately designed and developed features. A self-regulating system can ensure that this crucial point does not get out of focus. It works with service-level objectives (SLOs), an error budget, and automatic consequences. An SLO is a concrete goal, for example, a 99% availability of the service. Derived from this, the error budget is 1%, with which the team can do what it wants, such as chaos engineering. However, the error budget should at best be used for releases, since changes are usually accompanied by errors. If a goal is not achieved, and the error budget is exceeded, automatic consequences come into effect. This can manifest itself in different ways: a classic consequence is a feature deployment stop until the service is back in the budget. These goals will apply to the whole product team, not just the SREs/operations.

Tailored for success

As simple as the model appears, its implementation is complex. It is not enough to employ a few very experienced and talented SREs and let them take responsibility of operations. It is also important that they strive for the max-50/min-50 principle and implement SLOs, error budget, and consequences. This transformation is as complex as the agile transformation – and worth the effort. Products become more reliable, time to market is shortened, and everyone in the team, including the end users, enjoy rapid development and higher service availability.

But there is no such thing as a blueprint that can be simply re-used by each organization. Success, with the support of experts, comes with a tailor-made concept that fits to the needs of the organization.

Andreas Lutz is a Senior Enterprise architect, he leads the Capgemini i*Gov Innovation Lab – an innovation lab for the public sector. You can reach out to him here.