Testing strategies to build resilience in a chaotic world

Capgemini

2018-07-31

“Resilience or hardiness is the ability to adapt to new circumstances when life presents the unpredictable” – Salvatore R Maddi

In today’s world, system downtime is not an option. If a user can’t access an application once, chances are that they will never use it again. Resiliency, which in simple terms is the ability of a system to gracefully handle and recover from failures, thus becomes critical. Testing resiliency ensures the system’s ability to absorb the impact of a problem while continuing to provide an acceptable level of service to the business. In other words, to test resiliency introduce a defect and ensure that the system recovers gracefully. This concept was originally introduced by Netflix in the Principles of Chaos Engineering.

Netflix has defined the discipline as follows: chaos engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent conditions in production. This idea, which has proven to be very successful for Netflix, is now being adopted across industries. To design tests that fail and validate recovery requires that the test professional understand the architecture, design, and infrastructure of systems.

To build your test strategies for resilient systems, you should:

Conduct a failure mode analysis by reviewing the design of the system. In simple terms, this means identifying all the components, internal and external interfaces, and identifying potential failures at every point. Once failure points are identified, validate that there are alternatives to failure. For example, let’s say it is a service-based architecture and if the application depends on a single critical instance of service it can create a single point of failure. In this scenario, verify that if there is a request time/out then an alternative is available.
Validate data resiliency, i.e. that there is a mechanism for data to be available to applications if the system that originally hosted the data fails. Verify that the data backup process is either documented or automated. If automated, validate that the automated script backs up data correctly, maintaining integrity and schema.
From an infrastructure standpoint, configure and test health probes for load balancing and traffic management. These ensure that the system is not limited to a single region for deployment in case of latency issues.
From an application standpoint, conduct fault injection tests for every application in your system. Scenarios include shutting down interfacing systems, deleting certificates, consuming system resources, and deleting data sources.
Conduct critical tests in production with well-planned canary deployments. Validate that there is an automated rollback mechanism for code in production in case of failure.

Above all, the key to testing resiliency is continuous learning of the design architecture and infrastructure of systems. The more you learn, the more you understand the points of failure and the better you test.