Success in the world of site reliability engineering (SRE) can mean many things. Foremost, we must ask our ourselves the following questions: is our website up, is it performant, does it satisfy the needs of our clients, and does it deliver the best return-on-investment (ROI) for the company. Having a successful site means not only being familiar with these questions but also being able to provide solid answers.
In the past, we examined these issues and explained not only why they are important, but how each will help us achieve success. Today we are going to look in the other direction: we are going to see what tools are required to keep a site functioning at peak efficiency, how to deal with emergency issues, and most importantly how to avoid the inevitable staff burn-out that can come from doing so incorrectly.
We have our site up, we’ve tested it, it performs perfectly, and our customers are happy, but we can’t stop there. Successful enterprise engagements are based not only on how well things are working but how quickly things can be brought back online in the case of an emergency.
In order to triage emergencies quickly, we must have “eyes on” as many of the operational aspects of the site as possible. We achieve this through the use of a monitoring framework that allows us to tailor the tool to our specific needs. These tools will still require someone to be monitoring 24/7. We can counter some of these added issues by using an alerting framework on top of our monitoring.
WHAT IS ALERT TECHNOLOGY
SRE teams rarely have the luxury of being able to staff around the clock so it’s necessary to augment with a “virtual first level” support team in the form of an alerting infrastructure. While the monitoring environment will look for operational issues, the alert technology tools will take these detectable triggers and, through some significant design decisions, determine the best way to alert the team of any issues while guaranteeing the quickest response. The severity and impact of problems are taken into consideration when the alert technology infrastructure is designed.
There is a popular quote by Lewis Carroll that says, “If you don’t know where you are going, any road will get you there”. The quote is probably more of a misquote, but in the case of alert design, it’s spot on. If you don’t know what you are monitoring then any monitoring will be better than doing nothing.
This is a true statement and a fool’s strategy if you want to monitor properly. It’s still a situation where you can easily find yourself. The key to designing a proper alerting framework is to first find a monitoring tool that will give you the monitoring granularity you need, define your most important service level objectives (SLOs), then design the alerts based on your SLOs to provide the monitoring triggers necessary to keep your environment working smoothly.
Our goal is to be alerted on significant events which take the greatest amount of time and effort to address. In SRE terms we want to be able to identify and mitigate events that consume a large part of our “error budget.”
The modern web is a very complex place. It’s complex in the hardware, infrastructure, and in the software that comprises each and every site. If you don’t think about it beyond the term “operational complexity” it just sounds hard. When you are designing a monitoring and alert technology infrastructure this can be a daunting task. We can appreciate how difficult this can be when we consider the simple mathematical reality that shows that complexity rises at a near logarithmic rate based on the number of functional elements and processes in a webstack.
In a black-and-white world, our operational complexity would be simple. Let’s say you have 100 components in your webstack and you know the mean time between failure (MTBF) for each. In this case, the site reliability calculations become simple and your concerns could be easily minimized. Sadly, this is not a realistic view of the world.
In the real world, we have many failure modes that are hard to identify and calculate. The simplest issues and the ones that are most black and white are outright element failures. Your load-balancer dies, you lose power to the data center, you forget to pay your network bills, etc.
These failures are impactful, small in number, and, from a monitoring standpoint, addressable through go/no-go testing. Operational complexity is amplified beyond these base failures when you consider that each server instance in your stack can encounter catastrophic failures, each line of code can expose an issue, your network infrastructure could have too many collisions, your CDN could have issues (like Cloudflare, recently), you could be dealing with bad actors attacking your site or your infrastructure, and the list goes on.
The challenge in monitoring and proper alert technology strategy doesn’t come necessarily from systems that fail, but from components whose performance begins to degrade over time. The key is to identify these issues, determine a course correction, and then action the course correction before things completely fail. You still have to identify the condition to understand it and to develop a successful remediation strategy to deal with it.