Testing strategies to build resilience in a chaotic world

Publish date:

Testing resiliency is the ability of a system to handle and recover from failures. Read on to learn how you should build your software testing strategies.

“Resilience or hardiness is the ability to adapt to new circumstances when life presents the unpredictable” – Salvatore R Maddi

In today’s world, system downtime is not an option. If a user can’t access an application once, chances are that they will never use it again. Resiliency, which in simple terms is the ability of a system to gracefully handle and recover from failures, thus becomes critical. Testing resiliency ensures the system’s ability to absorb the impact of a problem while continuing to provide an acceptable level of service to the business. In other words, to test resiliency introduce a defect and ensure that the system recovers gracefully. This concept was originally introduced by Netflix in the Principles of Chaos Engineering.

Netflix has defined the discipline as follows: chaos engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent conditions in production. This idea, which has proven to be very successful for Netflix, is now being adopted across industries. To design tests that fail and validate recovery requires that the test professional understand the architecture, design, and infrastructure of systems.

To build your test strategies for resilient systems, you should:

  1. Conduct a failure mode analysis by reviewing the design of the system. In simple terms, this means identifying all the components, internal and external interfaces, and identifying potential failures at every point. Once failure points are identified, validate that there are alternatives to failure. For example, let’s say it is a service-based architecture and if the application depends on a single critical instance of service it can create a single point of failure. In this scenario, verify that if there is a request time/out then an alternative is available.
  2. Validate data resiliency, i.e. that there is a mechanism for data to be available to applications if the system that originally hosted the data fails. Verify that the data backup process is either documented or automated. If automated, validate that the automated script backs up data correctly, maintaining integrity and schema.
  3. From an infrastructure standpoint, configure and test health probes for load balancing and traffic management. These ensure that the system is not limited to a single region for deployment in case of latency issues.
  4. From an application standpoint, conduct fault injection tests for every application in your system. Scenarios include shutting down interfacing systems, deleting certificates, consuming system resources, and deleting data sources.
  5. Conduct critical tests in production with well-planned canary deployments. Validate that there is an automated rollback mechanism for code in production in case of failure.

Above all, the key to testing resiliency is continuous learning of the design architecture and infrastructure of systems. The more you learn, the more you understand the points of failure and the better you test.

Related Posts

automation

Financial firms leverage automation to improve both their top and bottom lines

Cliff Evans
Date icon September 18, 2018

These days, commercial banks are enthusiastically investing in automation to gain a competitive...

automation

Enterprise knowledge management: we’re all in the entertainment industry now

Christopher Stancombe
Date icon September 18, 2018

Enterprise knowledge management has to be more in line with user needs. We can learn a lot from...

robotics

A Girl and Her Robot

Shetu Shah
Date icon September 18, 2018

The Hour of Code teaches kids how to program a robot in less than one hour. No prior experience...

cookies.

By continuing to navigate on this website, you accept the use of cookies.

For more information and to change the setting of cookies on your computer, please read our Privacy Policy.

Close

Close cookie information