Chaos Engineering

Read it in 8 Mins

Last updated on
28th Apr, 2022
02nd Feb, 2021
Chaos Engineering

The 4th industrial revolution has swept the world. In just under a decade, our lives have become completely dependent on technology. The world has become a smaller place due to the internet and day by day we see an increase in the number of industries that are switching to the online platform. But this is still a new technology and emerging and developed economies are still trying to perfect the infrastructure and ecosystem which is needed to run these businesses online. This uncertainty makes failure more prevalent.  

We generally came across headlines "Customers report difficulty in accessing bank mobile and online", "Bank Website down, not working" , "Service Unavailable" and such unpredictability is occurring on a regular frequency.  

These outages/failures are often in complex and distributed systems, where often, several things fail at the same time, thereby compounding the problem. Finding the bugs and fixing them takes a couple of minutes to hours depending on system architecture, causing not only loss of revenue to the company but also loss of customer trust. 

The system is built to handle individual failures, but in big chaotic systems, failure of systems or processes may lead to severe outages. The term Microservice Death Starrefers to an architecture that is poorly designed, has highly interdependent complex systems that are slow, inflexible and can blow up and lead to failure. 

Chaos Engineering

Image Source

Structure of microservices at Amazon

Structure of microservices at Amazon
Image Source

In the old world, our system was more simplistic due to monolithic architecture. It was easy to debug errors and consequently fix them. Code changes were shipped once a quarter, or half-yearly. But today, architecture has changed a lot with migration to the cloud where innovation and speed of execution have become part for our system. The system is changing not in order of weeks and days but in order of minutes and hours. 

Usage of cloud-based and microservice architecture has provided us with a lot of advantages but come with complexity and chaos which can cause failure. It is an engineer’s responsibility to make the system as reliable as it can be 

Netflix's Way of Dealing with the system has taught us a better approach and has given birth to a new discipline "Chaos Engineering". Let's discuss more about it below.  

Chaos Engineering and its Need:

As Defined by Netflix Engineer: 

"Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions" 

Reference Link.

Chaos engineering is the process of exposing a software system by introducing disruptive events, such as server outages or API throttling. In this process, we introduce  failure scenarios, faults, to test  the system’s capability of surviving against unstable and unexpected conditions. 

It also helps teams to simulate real-world conditions needed to uncover the hidden issues, monitoring blind spots, and performance bottlenecks that are difficult to find in distributed systems. This method is quite effective in preventing downtime or production outages before their occurrence. 

The Need for Chaos Engineering: 

How does it benefit? 

  • Implementing Chaos Engineering improves the resilience of a system.  
  • By designing and executing Chaos Engineering experiments, we  get to know about weaknesses in the system that could lead to outages, which in turn can lose us customers. This helps improve incident response. 
  • It helps us to improve the understanding of the risk of the system by exposing threats to the system.  

Principles of Chaos Engineering: 

The term Chaos Engineering was designed  by Engineers at Netflix. Chaos Engineering Experiments are designed based on the following four principles: 

Principles of Chaos Engineering

  1. Define system’s normal behaviour: First, the steady state of the system is defined, thereby defining some measurable outputs which can indicate the system’s normal behaviour. 
  2. Creating Hypothesis:  During an experiment, we need a hypothesis for comparing to a stable control group, and the same applies here too. If there is a reasonable expectation for particular action according to which we will change the steady state of a system, then the first thing to do is to fix the system so that we accommodate for the action that will potentially have that effect on the system.  
  3. Apply real-world events: Design and create experiments by introducing real-world events like terminating servers, network failures, latency, dependency failure, memory malfunction, etc. 
  4. Observe Results: In this, we will be comparing steady-state metrics with the system after introducing disturbance. For monitoring we can use cloudwatchKibanasplunk etc or any other tool which is already part of the system architecture. If there will be a difference in results, it can be used to identify future incidents, and improvements can be made. Otherwise, if there is no difference, it can improve a higher degree of trust and confidence about application among team members. 

Difference Between Chaos Engineering And Testing : 

When we develop an application, we pass it through various tests that include Unit Tests, Integration Tests, and System Tests. 

With Unit testing, we write a unit test case and check the expected behaviour of a component that is independent of all external components whereas Integration testing checks the interaction of individual and inter-dependant components. But even extensive testing does not provide us with a guaranteed error-free system because this testing examines only pre-defined and single scenarios. The results don't cover new information about the application, system behaviour, performance, and properties. This uncertainty increases with the use of microservice architectures, where the system grows with passing time. 

Whereas in chaos, it generates a wide range and unpredictable outcome for experimenting on a distributed architecture to build confidence in the system’s capability and withstand turbulent conditions in production. Chaos Testing is a deliberate introduction of failure and faulty scenarios into our system to understand how the system will react and what could be its side effects. This type of testing is an effective method to prevent/minimize outages before they impact the system and ultimately the business.  

Chaos Engineering Examples 

There are many chaos experiments that we can inject and test our system with, which mainly depend on our goals and system architecture.  

Below is a list of the most common chaos tests: 

  • Simulating the failure of a micro-component and dependency. 
  • Simulating a high CPU load and sudden increase in traffic. 
  • Simulating failure of entire AZ(Availability Zone) or region. 
  • Injecting latency and byzantine failures in services. 
  • Exhausting memory on instances(cloud services) and allowing fault injection. 
  • Causing Host Failure. 

List of Tools Developed by Netflix: 

The Netflix Team has created a suite of tools that support chaos engineering principles and named it the Simian Army. The tools constantly testthe reliability, security, or resiliency of its Amazon Web Services infrastructure. 

  • Chaos Monkey: It is a tool that is used to test the resilience of the system. It works by disabling one system of production and testing how other remaining systems respond to the outage. It is designed to test system stability by enforcing failures and later on checking the response of the system.

The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez 

"Imagine a monkey entering a 'data centre', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy." 

Reference link.

  • Latency Monkey: This is useful in testing fault tolerance of service by creating communication delays to provoke outages in the network. 
  • Doctor Monkey: It checks the health status as well as other components related to health of the system i.e. CPU load to detect unhealthy instances and eventually fixing the instance. 
  • Conformity MonkeyIt finds the instance that doesn't adhere to best practices against a set of rules and sends an email notification to the owner of the instance. 
  • Janitor Monkey: Ensures cloud service is working free of unused resources and clutter. Disposes of any waste. 
  • Security Monkey: It is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. 
  • Chaos Gorilla: It is similar to Chaos Monkey, but drops full Availability Zone while testing. 

Chaos Engineering and DevOps: 

When it comes to DevOps and running SDLC, implementing chaos principles in the system helps in understanding system ability against failure, which later on helps in reducing incidents in production. 

There are scenarios, where we quickly need to deploy the software in an environment, for all those cases we can perform chaos engineering in distributed, continuous-changing, and complex development methodologies to find unexpected failures. 


  • Insights received after running chaos testing can lead to a reduction in production incidents for the future. 
  • Through Chaos Engineering, the team can verify the system's behaviour on failure so that accordingly it takes action. 
  • Chaos Engineering helps in the testing response of the team to the incident. Also, helps in testing if the raised alert has been notified to the correct team. 
  • On a high level, Chaos Engineering provides us an advantage by overall system availability. Chaos Experiments make the system more resilient to failures. 
  • Production outages can lead to huge losses to companies depending on the usage of the system, therefore chaos engineering helps in the prevention of large losses in revenue. 
  • It helps in improving the confidence and engagement of team members for carrying out disaster recovery methods and makes applications highly reliable. 


  • Implementing Chaos Monkey for a large-scale system and experimenting can lead to an increase in cost. 
  • Carelessness or Incorrect steps in formation and implementation can impact the application, thereby hampering the customer. 
  • While implementing the project, it doesn't provide any Interface to track and monitor. It runs through scripts and configuration files. 
  • It doesn't support all kinds of deployment.  


In the present world of Software Development Lifecycle, chaos engineering has become a magnificent tool which can help organizations to not only improvresiliency, flexibility, and velocity of the system, but also helps in operating distributed system. Along with these benefits, it has also provided us with remediation of the issue before it impacts the system. Implementation of Chaos Engineering is important and should be adopted for better outcomes. 

In the above article, we have shared a brief about chaos engineering and demonstrated how it can provide new insights to the system. 

Hope this article has provided you with valuable insights about chaos engineering. This is an extensive field and there is a lot more to learn about it.   


Kanav Preet


Kanav is working as SRE in leading fintech firm having experience in CICD Pipeline, Cloud, Automation, Build Release  and Deployment. She is passionate about leveraging technology to build innovative and effective software solutions. Her insight, passion and energy results in her engaging a strong clientele who move ahead with her ideas. She has done various certifications in  Continuous delivery & DevOps (University of Virginia), tableau , Linux (Linux foundation) and many more.