The 4th industrial revolution has swept the world. In just under a decade, our lives have become completely dependent on technology. The world has become a smaller place due to the internet and day by day we see an increase in the number of industries that are switching to the online platform. But this is still a new technology and emerging and developed economies are still trying to perfect the infrastructure and ecosystem which is needed to run these businesses online. This uncertainty makes failure more prevalent.
We generally came across headlines "Customers report difficulty in accessing bank mobile and online", "Bank Website down, not working" , "Service Unavailable" and such unpredictability is occurring on a regular frequency.
These outages/failures are often in complex and distributed systems, where often, several things fail at the same time, thereby compounding the problem. Finding the bugs and fixing them takes a couple of minutes to hours depending on system architecture, causing not only loss of revenue to the company but also loss of customer trust.
The system is built to handle individual failures, but in big chaotic systems, failure of systems or processes may lead to severe outages. The term Microservice Death Star, refers to an architecture that is poorly designed, has highly interdependent complex systems that are slow, inflexible and can blow up and lead to failure.
In the old world, our system was more simplistic due to monolithic architecture. It was easy to debug errors and consequently fix them. Code changes were shipped once a quarter, or half-yearly. But today, architecture has changed a lot with migration to the cloud where innovation and speed of execution have become part for our system. The system is changing not in order of weeks and days but in order of minutes and hours.
Usage of cloud-based and microservice architecture has provided us with a lot of advantages but come with complexity and chaos which can cause failure. It is an engineer’s responsibility to make the system as reliable as it can be.
Netflix's Way of Dealing with the system has taught us a better approach and has given birth to a new discipline "Chaos Engineering". Let's discuss more about it below.
As Defined by a Netflix Engineer:
"Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions"
Chaos engineering is the process of exposing a software system by introducing disruptive events, such as server outages or API throttling. In this process, we introduce failure scenarios, faults, to test the system’s capability of surviving against unstable and unexpected conditions.
It also helps teams to simulate real-world conditions needed to uncover the hidden issues, monitoring blind spots, and performance bottlenecks that are difficult to find in distributed systems. This method is quite effective in preventing downtime or production outages before their occurrence.
The term Chaos Engineering was designed by Engineers at Netflix. Chaos Engineering Experiments are designed based on the following four principles:
When we develop an application, we pass it through various tests that include Unit Tests, Integration Tests, and System Tests.
With Unit testing, we write a unit test case and check the expected behaviour of a component that is independent of all external components whereas Integration testing checks the interaction of individual and inter-dependant components. But even extensive testing does not provide us with a guaranteed error-free system because this testing examines only pre-defined and single scenarios. The results don't cover new information about the application, system behaviour, performance, and properties. This uncertainty increases with the use of microservice architectures, where the system grows with passing time.
Whereas in chaos, it generates a wide range and unpredictable outcome for experimenting on a distributed architecture to build confidence in the system’s capability and withstand turbulent conditions in production. Chaos Testing is a deliberate introduction of failure and faulty scenarios into our system to understand how the system will react and what could be its side effects. This type of testing is an effective method to prevent/minimize outages before they impact the system and ultimately the business.
There are many chaos experiments that we can inject and test our system with, which mainly depend on our goals and system architecture.
Below is a list of the most common chaos tests:
The Netflix Team has created a suite of tools that support chaos engineering principles and named it the Simian Army. The tools constantly test the reliability, security, or resiliency of its Amazon Web Services infrastructure.
Chaos Monkey: It is a tool that is used to test the resilience of the system. It works by disabling one system of production and testing how other remaining systems respond to the outage. It is designed to test system stability by enforcing failures and later on checking the response of the system.
The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez
"Imagine a monkey entering a 'data centre', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."
When it comes to DevOps and running SDLC, implementing chaos principles in the system helps in understanding system ability against failure, which later on helps in reducing incidents in production.
There are scenarios, where we quickly need to deploy the software in an environment, for all those cases we can perform chaos engineering in distributed, continuous-changing, and complex development methodologies to find unexpected failures.
In the present world of Software Development Lifecycle, chaos engineering has become a magnificent tool which can help organizations to not only improve resiliency, flexibility, and velocity of the system, but also helps in operating distributed system. Along with these benefits, it has also provided us with remediation of the issue before it impacts the system. Implementation of Chaos Engineering is important and should be adopted for better outcomes.
In the above article, we have shared a brief about chaos engineering and demonstrated how it can provide new insights to the system.
Hope this article has provided you with valuable insights about chaos engineering. This is an extensive field and there is a lot more to learn about it.
Your email address will not be published. Required fields are marked *