Explore Courses
course iconScrum AllianceCertified ScrumMaster (CSM) Certification
  • 16 Hours
Best seller
course iconScrum AllianceCertified Scrum Product Owner (CSPO) Certification
  • 16 Hours
Best seller
course iconScaled AgileLeading SAFe 6.0 Certification
  • 16 Hours
Trending
course iconScrum.orgProfessional Scrum Master (PSM) Certification
  • 16 Hours
course iconScaled AgileSAFe 6.0 Scrum Master (SSM) Certification
  • 16 Hours
course iconScaled Agile, Inc.Implementing SAFe 6.0 (SPC) Certification
  • 32 Hours
Recommended
course iconScaled Agile, Inc.SAFe 6.0 Release Train Engineer (RTE) Certification
  • 24 Hours
course iconScaled Agile, Inc.SAFe® 6.0 Product Owner/Product Manager (POPM)
  • 16 Hours
Trending
course iconIC AgileICP Agile Certified Coaching (ICP-ACC)
  • 24 Hours
course iconScrum.orgProfessional Scrum Product Owner I (PSPO I) Training
  • 16 Hours
course iconAgile Management Master's Program
  • 32 Hours
Trending
course iconAgile Excellence Master's Program
  • 32 Hours
Agile and ScrumScrum MasterProduct OwnerSAFe AgilistAgile CoachFull Stack Developer BootcampData Science BootcampCloud Masters BootcampReactNode JsKubernetesCertified Ethical HackingAWS Solutions Architect AssociateAzure Data Engineercourse iconPMIProject Management Professional (PMP) Certification
  • 36 Hours
Best seller
course iconAxelosPRINCE2 Foundation & Practitioner Certification
  • 32 Hours
course iconAxelosPRINCE2 Foundation Certification
  • 16 Hours
course iconAxelosPRINCE2 Practitioner Certification
  • 16 Hours
Change ManagementProject Management TechniquesCertified Associate in Project Management (CAPM) CertificationOracle Primavera P6 CertificationMicrosoft Projectcourse iconJob OrientedProject Management Master's Program
  • 45 Hours
Trending
course iconProject Management Master's Program
  • 45 Hours
Trending
PRINCE2 Practitioner CoursePRINCE2 Foundation CourseProject ManagerProgram Management ProfessionalPortfolio Management Professionalcourse iconAWSAWS Certified Solutions Architect - Associate
  • 32 Hours
Best seller
course iconAWSAWS Cloud Practitioner Certification
  • 32 Hours
course iconAWSAWS DevOps Certification
  • 24 Hours
course iconMicrosoftAzure Fundamentals Certification
  • 16 Hours
course iconMicrosoftAzure Administrator Certification
  • 24 Hours
Best seller
course iconMicrosoftAzure Data Engineer Certification
  • 45 Hours
Recommended
course iconMicrosoftAzure Solution Architect Certification
  • 32 Hours
course iconMicrosoftAzure DevOps Certification
  • 40 Hours
course iconAWSSystems Operations on AWS Certification Training
  • 24 Hours
course iconAWSDeveloping on AWS
  • 24 Hours
course iconJob OrientedAWS Cloud Architect Masters Program
  • 48 Hours
New
course iconCareer KickstarterCloud Engineer Bootcamp
  • 100 Hours
Trending
Cloud EngineerCloud ArchitectAWS Certified Developer Associate - Complete GuideAWS Certified DevOps EngineerAWS Certified Solutions Architect AssociateMicrosoft Certified Azure Data Engineer AssociateMicrosoft Azure Administrator (AZ-104) CourseAWS Certified SysOps Administrator AssociateMicrosoft Certified Azure Developer AssociateAWS Certified Cloud Practitionercourse iconAxelosITIL 4 Foundation Certification
  • 16 Hours
Best seller
course iconAxelosITIL Practitioner Certification
  • 16 Hours
course iconPeopleCertISO 14001 Foundation Certification
  • 16 Hours
course iconPeopleCertISO 20000 Certification
  • 16 Hours
course iconPeopleCertISO 27000 Foundation Certification
  • 24 Hours
course iconAxelosITIL 4 Specialist: Create, Deliver and Support Training
  • 24 Hours
course iconAxelosITIL 4 Specialist: Drive Stakeholder Value Training
  • 24 Hours
course iconAxelosITIL 4 Strategist Direct, Plan and Improve Training
  • 16 Hours
ITIL 4 Specialist: Create, Deliver and Support ExamITIL 4 Specialist: Drive Stakeholder Value (DSV) CourseITIL 4 Strategist: Direct, Plan, and ImproveITIL 4 Foundationcourse iconJob OrientedData Science Bootcamp
  • 6 Months
Trending
course iconJob OrientedData Engineer Bootcamp
  • 289 Hours
course iconJob OrientedData Analyst Bootcamp
  • 6 Months
course iconJob OrientedAI Engineer Bootcamp
  • 288 Hours
New
Data Science with PythonMachine Learning with PythonData Science with RMachine Learning with RPython for Data ScienceDeep Learning Certification TrainingNatural Language Processing (NLP)TensorFlowSQL For Data AnalyticsData ScientistData AnalystData EngineerAI EngineerData Analysis Using ExcelDeep Learning with Keras and TensorFlowDeployment of Machine Learning ModelsFundamentals of Reinforcement LearningIntroduction to Cutting-Edge AI with TransformersMachine Learning with PythonMaster Python: Advance Data Analysis with PythonMaths and Stats FoundationNatural Language Processing (NLP) with PythonPython for Data ScienceSQL for Data Analytics CoursesAI Advanced: Computer Vision for AI ProfessionalsMaster Applied Machine LearningMaster Time Series Forecasting Using Pythoncourse iconDevOps InstituteDevOps Foundation Certification
  • 16 Hours
Best seller
course iconCNCFCertified Kubernetes Administrator
  • 32 Hours
New
course iconDevops InstituteDevops Leader
  • 16 Hours
KubernetesDocker with KubernetesDockerJenkinsOpenstackAnsibleChefPuppetDevOps EngineerDevOps ExpertCI/CD with Jenkins XDevOps Using JenkinsCI-CD and DevOpsDocker & KubernetesDevOps Fundamentals Crash CourseMicrosoft Certified DevOps Engineer ExpertAnsible for Beginners: The Complete Crash CourseContainer Orchestration Using KubernetesContainerization Using DockerMaster Infrastructure Provisioning with Terraformcourse iconCertificationTableau Certification
  • 24 Hours
Recommended
course iconCertificationData Visualization with Tableau Certification
  • 24 Hours
course iconMicrosoftMicrosoft Power BI Certification
  • 24 Hours
Best seller
course iconTIBCOTIBCO Spotfire Training
  • 36 Hours
course iconCertificationData Visualization with QlikView Certification
  • 30 Hours
course iconCertificationSisense BI Certification
  • 16 Hours
Data Visualization Using Tableau TrainingData Analysis Using Excelcourse iconCompTIACompTIA Security+
  • 40 Hours
Best seller
course iconEC-CouncilCertified Ethical Hacker (CEH v12) Certification
  • 40 Hours
course iconISACACertified Information Systems Auditor (CISA) Certification
  • 22 Hours
course iconISACACertified Information Security Manager (CISM) Certification
  • 40 Hours
course icon(ISC)²Certified Information Systems Security Professional (CISSP)
  • 40 Hours
course icon(ISC)²Certified Cloud Security Professional (CCSP) Certification
  • 40 Hours
course iconCertified Information Privacy Professional - Europe (CIPP-E) Certification
  • 16 Hours
course iconISACACOBIT5 Foundation
  • 16 Hours
course iconPayment Card Industry Security Standards (PCI-DSS) Certification
  • 16 Hours
CISSPcourse iconCareer KickstarterFull-Stack Developer Bootcamp
  • 6 Months
Best seller
course iconJob OrientedUI/UX Design Bootcamp
  • 3 Months
Best seller
course iconEnterprise RecommendedJava Full Stack Developer Bootcamp
  • 6 Months
course iconCareer KickstarterFront-End Development Bootcamp
  • 490+ Hours
course iconCareer AcceleratorBackend Development Bootcamp (Node JS)
  • 4 Months
ReactNode JSAngularJavascriptPHP and MySQLAngular TrainingBasics of Spring Core and MVCFront-End Development BootcampReact JS TrainingSpring Boot and Spring CloudMongoDB Developer Coursecourse iconBlockchain Professional Certification
  • 40 Hours
course iconBlockchain Solutions Architect Certification
  • 32 Hours
course iconBlockchain Security Engineer Certification
  • 32 Hours
course iconBlockchain Quality Engineer Certification
  • 24 Hours
course iconBlockchain 101 Certification
  • 5+ Hours
NFT Essentials 101: A Beginner's GuideIntroduction to DeFiPython CertificationAdvanced Python CourseR Programming LanguageAdvanced R CourseJavaJava Deep DiveScalaAdvanced ScalaC# TrainingMicrosoft .Net Frameworkcourse iconCareer AcceleratorSoftware Engineer Interview Prep
  • 3 Months
Data Structures and Algorithms with JavaScriptData Structures and Algorithms with Java: The Practical GuideLinux Essentials for Developers: The Complete MasterclassMaster Git and GitHubMaster Java Programming LanguageProgramming Essentials for BeginnersSoftware Engineering Fundamentals and Lifecycle (SEFLC) CourseTest-Driven Development for Java ProgrammersTypeScript: Beginner to Advanced

What is Chaos Engineering

By Kanav Preet

Updated on Jul 10, 2025 | 8 min read | 11.23K+ views

Share:

The 4th industrial revolution has swept the world. In just under a decade, our lives have become completely dependent on technology. The world has become a smaller place due to the internet and day by day we see an increase in the number of industries that are switching to the online platform. But this is still a new technology and emerging and developed economies are still trying to perfect the infrastructure and ecosystem which is needed to run these businesses online. This uncertainty makes failure more prevalent.  

We generally came across headlines "Customers report difficulty in accessing bank mobile and online", "Bank Website down, not working" , "Service Unavailable" and such unpredictability is occurring on a regular frequency.  

These outages/failures are often in complex and distributed systems, where often, several things fail at the same time, thereby compounding the problem. Finding the bugs and fixing them takes a couple of minutes to hours depending on system architecture, causing not only loss of revenue to the company but also loss of customer trust. 

The system is built to handle individual failures, but in big chaotic systems, failure of systems or processes may lead to severe outages. The term Microservice Death Star, refers to an architecture that is poorly designed, has highly interdependent complex systems that are slow, inflexible and can blow up and lead to failure. 

Last Few Days to Save Up To 90% on Career Transformation

Ends December 1 – Don't Miss Out!

Structure of microservices at Amazon

Image Source

In the old world, our system was more simplistic due to monolithic architecture. It was easy to debug errors and consequently fix them. Code changes were shipped once a quarter, or half-yearly. But today, architecture has changed a lot with migration to the cloud where innovation and speed of execution have become part for our system. The system is changing not in order of weeks and days but in order of minutes and hours. 

Usage of cloud-based and microservice architecture has provided us with a lot of advantages but come with complexity and chaos which can cause failure. It is an engineer’s responsibility to make the system as reliable as it can be.  

Netflix's Way of Dealing with the system has taught us a better approach and has given birth to a new discipline "Chaos Engineering". Let's discuss more about it below.  

Chaos Engineering and its Need:

As Defined by a Netflix Engineer: 

"Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions" 

Reference Link.

Chaos engineering is the process of exposing a software system by introducing disruptive events, such as server outages or API throttling. In this process, we introduce  failure scenarios, faults, to test  the system’s capability of surviving against unstable and unexpected conditions. 

It also helps teams to simulate real-world conditions needed to uncover the hidden issues, monitoring blind spots, and performance bottlenecks that are difficult to find in distributed systems. This method is quite effective in preventing downtime or production outages before their occurrence. 

The Need for Chaos Engineering: 

How does it benefit? 

  • Implementing Chaos Engineering improves the resilience of a system.  
  • By designing and executing Chaos Engineering experiments, we  get to know about weaknesses in the system that could lead to outages, which in turn can lose us customers. This helps improve incident response. 
  • It helps us to improve the understanding of the risk of the system by exposing threats to the system.  

Principles of Chaos Engineering: 

The term Chaos Engineering was designed  by Engineers at Netflix. Chaos Engineering Experiments are designed based on the following four principles: 

  1. Define system’s normal behaviour: First, the steady state of the system is defined, thereby defining some measurable outputs which can indicate the system’s normal behaviour. 
  2. Creating Hypothesis:  During an experiment, we need a hypothesis for comparing to a stable control group, and the same applies here too. If there is a reasonable expectation for a particular action according to which we will change the steady state of a system, then the first thing to do is to fix the system so that we accommodate for the action that will potentially have that effect on the system.  
  3. Apply real-world events: Design and create experiments by introducing real-world events like terminating servers, network failures, latency, dependency failure, memory malfunction, etc. 
  4. Observe Results: In this, we will be comparing steady-state metrics with the system after introducing disturbance. For monitoring we can use cloudwatch, Kibana, splunk etc or any other tool which is already part of the system architecture. If there will be a difference in results, it can be used to identify future incidents, and improvements can be made. Otherwise, if there is no difference, it can improve a higher degree of trust and confidence about application among team members. 

Difference Between Chaos Engineering And Testing 

When we develop an application, we pass it through various tests that include Unit Tests, Integration Tests, and System Tests. 

With Unit testing, we write a unit test case and check the expected behaviour of a component that is independent of all external components whereas Integration testing checks the interaction of individual and inter-dependant components. But even extensive testing does not provide us with a guaranteed error-free system because this testing examines only pre-defined and single scenarios. The results don't cover new information about the application, system behaviour, performance, and properties. This uncertainty increases with the use of microservice architectures, where the system grows with passing time. 

Whereas in chaos, it generates a wide range and unpredictable outcome for experimenting on a distributed architecture to build confidence in the system’s capability and withstand turbulent conditions in production. Chaos Testing is a deliberate introduction of failure and faulty scenarios into our system to understand how the system will react and what could be its side effects. This type of testing is an effective method to prevent/minimize outages before they impact the system and ultimately the business.  

Chaos Engineering Examples 

There are many chaos experiments that we can inject and test our system with, which mainly depend on our goals and system architecture.  

Below is a list of the most common chaos tests: 

  • Simulating the failure of a micro-component and dependency. 
  • Simulating a high CPU load and sudden increase in traffic. 
  • Simulating failure of entire AZ(Availability Zone) or region. 
  • Injecting latency and byzantine failures in services. 
  • Exhausting memory on instances(cloud services) and allowing fault injection. 
  • Causing Host Failure. 

List of Tools Developed by Netflix: 

The Netflix Team has created a suite of tools that support chaos engineering principles and named it the Simian Army. The tools constantly test the reliability, security, or resiliency of its Amazon Web Services infrastructure. 

  • Chaos Monkey: It is a tool that is used to test the resilience of the system. It works by disabling one system of production and testing how other remaining systems respond to the outage. It is designed to test system stability by enforcing failures and later on checking the response of the system.

The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez 

"Imagine a monkey entering a 'data centre', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy." 

Reference link.

  • Latency Monkey: This is useful in testing fault tolerance of service by creating communication delays to provoke outages in the network. 
  • Doctor Monkey: It checks the health status as well as other components related to health of the system i.e. CPU load to detect unhealthy instances and eventually fixing the instance. 
  • Conformity Monkey: It finds the instance that doesn't adhere to best practices against a set of rules and sends an email notification to the owner of the instance. 
  • Janitor Monkey: Ensures cloud service is working free of unused resources and clutter. Disposes of any waste. 
  • Security Monkey: It is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. 
  • Chaos Gorilla: It is similar to Chaos Monkey, but drops full Availability Zone while testing. 

Chaos Engineering and DevOps: 

When it comes to DevOps and running SDLC, implementing chaos principles in the system helps in understanding system ability against failure, which later on helps in reducing incidents in production. 

There are scenarios, where we quickly need to deploy the software in an environment, for all those cases we can perform chaos engineering in distributed, continuous-changing, and complex development methodologies to find unexpected failures. 

Advantages: 

  • Insights received after running chaos testing can lead to a reduction in production incidents for the future. 
  • Through Chaos Engineering, the team can verify the system's behaviour on failure so that accordingly it takes action. 
  • Chaos Engineering helps in the testing response of the team to the incident. Also, helps in testing if the raised alert has been notified to the correct team. 
  • On a high level, Chaos Engineering provides us an advantage by overall system availability. Chaos Experiments make the system more resilient to failures. 
  • Production outages can lead to huge losses to companies depending on the usage of the system, therefore chaos engineering helps in the prevention of large losses in revenue. 
  • It helps in improving the confidence and engagement of team members for carrying out disaster recovery methods and makes applications highly reliable. 

Disadvantages: 

  • Implementing Chaos Monkey for a large-scale system and experimenting can lead to an increase in cost. 
  • Carelessness or Incorrect steps in formation and implementation can impact the application, thereby hampering the customer. 
  • While implementing the project, it doesn't provide any Interface to track and monitor. It runs through scripts and configuration files. 
  • It doesn't support all kinds of deployment.  

Conclusion:

In the present world of Software Development Lifecycle, chaos engineering has become a magnificent tool which can help organizations to not only improve resiliency, flexibility, and velocity of the system, but also helps in operating distributed system. Along with these benefits, it has also provided us with remediation of the issue before it impacts the system. Implementation of Chaos Engineering is important and should be adopted for better outcomes. 

In the above article, we have shared a brief about chaos engineering and demonstrated how it can provide new insights to the system. 

Hope this article has provided you with valuable insights about chaos engineering. This is an extensive field and there is a lot more to learn about it.   

Kanav Preet

8 articles published

Kanav is working as SRE in leading fintech firm having experience in CICD Pipeline, Cloud, Automation, Build Release and Deployment. She is passionate about leveraging technology to build innovative ...

Get Free Consultation

+91

By submitting, I accept the T&C and
Privacy Policy

Preparing to hone DevOps Interview Questions?