- Blog Categories
- Project Management
- Agile Management
- IT Service Management
- Cloud Computing
- Business Management
- BI And Visualisation
- Quality Management
- Cyber Security
- DevOps
- Most Popular Blogs
- PMP Exam Schedule for 2025: Check PMP Exam Date
- Top 60+ PMP Exam Questions and Answers for 2025
- PMP Cheat Sheet and PMP Formulas To Use in 2025
- What is PMP Process? A Complete List of 49 Processes of PMP
- Top 15+ Project Management Case Studies with Examples 2025
- Top Picks by Authors
- Top 170 Project Management Research Topics
- What is Effective Communication: Definition
- How to Create a Project Plan in Excel in 2025?
- PMP Certification Exam Eligibility in 2025 [A Complete Checklist]
- PMP Certification Fees - All Aspects of PMP Certification Fee
- Most Popular Blogs
- CSM vs PSM: Which Certification to Choose in 2025?
- How Much Does Scrum Master Certification Cost in 2025?
- CSPO vs PSPO Certification: What to Choose in 2025?
- 8 Best Scrum Master Certifications to Pursue in 2025
- Safe Agilist Exam: A Complete Study Guide 2025
- Top Picks by Authors
- SAFe vs Agile: Difference Between Scaled Agile and Agile
- Top 21 Scrum Best Practices for Efficient Agile Workflow
- 30 User Story Examples and Templates to Use in 2025
- State of Agile: Things You Need to Know
- Top 24 Career Benefits of a Certifed Scrum Master
- Most Popular Blogs
- ITIL Certification Cost in 2025 [Exam Fee & Other Expenses]
- Top 17 Required Skills for System Administrator in 2025
- How Effective Is Itil Certification for a Job Switch?
- IT Service Management (ITSM) Role and Responsibilities
- Top 25 Service Based Companies in India in 2025
- Top Picks by Authors
- What is Escalation Matrix & How Does It Work? [Types, Process]
- ITIL Service Operation: Phases, Functions, Best Practices
- 10 Best Facility Management Software in 2025
- What is Service Request Management in ITIL? Example, Steps, Tips
- An Introduction To ITIL® Exam
- Most Popular Blogs
- A Complete AWS Cheat Sheet: Important Topics Covered
- Top AWS Solution Architect Projects in 2025
- 15 Best Azure Certifications 2025: Which one to Choose?
- Top 22 Cloud Computing Project Ideas in 2025 [Source Code]
- How to Become an Azure Data Engineer? 2025 Roadmap
- Top Picks by Authors
- Top 40 IoT Project Ideas and Topics in 2025 [Source Code]
- The Future of AWS: Top Trends & Predictions in 2025
- AWS Solutions Architect vs AWS Developer [Key Differences]
- Top 20 Azure Data Engineering Projects in 2025 [Source Code]
- 25 Best Cloud Computing Tools in 2025
- Most Popular Blogs
- Company Analysis Report: Examples, Templates, Components
- 400 Trending Business Management Research Topics
- Business Analysis Body of Knowledge (BABOK): Guide
- ECBA Certification: Is it Worth it?
- Top Picks by Authors
- Top 20 Business Analytics Project in 2025 [With Source Code]
- ECBA Certification Cost Across Countries
- Top 9 Free Business Requirements Document (BRD) Templates
- Business Analyst Job Description in 2025 [Key Responsibility]
- Business Analysis Framework: Elements, Process, Techniques
- Most Popular Blogs
- Best Career options after BA [2025]
- Top Career Options after BCom to Know in 2025
- Top 10 Power Bi Books of 2025 [Beginners to Experienced]
- Power BI Skills in Demand: How to Stand Out in the Job Market
- Top 15 Power BI Project Ideas
- Top Picks by Authors
- 10 Limitations of Power BI: You Must Know in 2025
- Top 45 Career Options After BBA in 2025 [With Salary]
- Top Power BI Dashboard Templates of 2025
- What is Power BI Used For - Practical Applications Of Power BI
- SSRS Vs Power BI - What are the Key Differences?
- Most Popular Blogs
- Data Collection Plan For Six Sigma: How to Create One?
- Quality Engineer Resume for 2025 [Examples + Tips]
- 20 Best Quality Management Certifications That Pay Well in 2025
- Six Sigma in Operations Management [A Brief Introduction]
- Top Picks by Authors
- Six Sigma Green Belt vs PMP: What's the Difference
- Quality Management: Definition, Importance, Components
- Adding Green Belt Certifications to Your Resume
- Six Sigma Green Belt in Healthcare: Concepts, Benefits and Examples
- Most Popular Blogs
- Latest CISSP Exam Dumps of 2025 [Free CISSP Dumps]
- CISSP vs Security+ Certifications: Which is Best in 2025?
- Best CISSP Study Guides for 2025 + CISSP Study Plan
- How to Become an Ethical Hacker in 2025?
- Top Picks by Authors
- CISSP vs Master's Degree: Which One to Choose in 2025?
- CISSP Endorsement Process: Requirements & Example
- OSCP vs CISSP | Top Cybersecurity Certifications
- How to Pass the CISSP Exam on Your 1st Attempt in 2025?
- Most Popular Blogs
- Top 7 Kubernetes Certifications in 2025
- Kubernetes Pods: Types, Examples, Best Practices
- DevOps Methodologies: Practices & Principles
- Docker Image Commands
- Top Picks by Authors
- Best DevOps Certifications in 2025
- 20 Best Automation Tools for DevOps
- Top 20 DevOps Projects of 2025
- OS for Docker: Features, Factors and Tips
- More
- Agile & PMP Practice Tests
- Agile Testing
- Agile Scrum Practice Exam
- CAPM Practice Test
- PRINCE2 Foundation Exam
- PMP Practice Exam
- Cloud Related Practice Test
- Azure Infrastructure Solutions
- AWS Solutions Architect
- IT Related Pratice Test
- ITIL Practice Test
- Devops Practice Test
- TOGAF® Practice Test
- Other Practice Test
- Oracle Primavera P6 V8
- MS Project Practice Test
- Project Management & Agile
- Project Management Interview Questions
- Release Train Engineer Interview Questions
- Agile Coach Interview Questions
- Scrum Interview Questions
- IT Project Manager Interview Questions
- Cloud & Data
- Azure Databricks Interview Questions
- AWS architect Interview Questions
- Cloud Computing Interview Questions
- AWS Interview Questions
- Kubernetes Interview Questions
- Web Development
- CSS3 Free Course with Certificates
- Basics of Spring Core and MVC
- Javascript Free Course with Certificate
- React Free Course with Certificate
- Node JS Free Certification Course
- Data Science
- Python Machine Learning Course
- Python for Data Science Free Course
- NLP Free Course with Certificate
- Data Analysis Using SQL
What is Chaos Engineering
By Kanav Preet
Updated on Jul 10, 2025 | 8 min read | 11.23K+ views
Share:
Table of Contents
View all
The 4th industrial revolution has swept the world. In just under a decade, our lives have become completely dependent on technology. The world has become a smaller place due to the internet and day by day we see an increase in the number of industries that are switching to the online platform. But this is still a new technology and emerging and developed economies are still trying to perfect the infrastructure and ecosystem which is needed to run these businesses online. This uncertainty makes failure more prevalent.
We generally came across headlines "Customers report difficulty in accessing bank mobile and online", "Bank Website down, not working" , "Service Unavailable" and such unpredictability is occurring on a regular frequency.
These outages/failures are often in complex and distributed systems, where often, several things fail at the same time, thereby compounding the problem. Finding the bugs and fixing them takes a couple of minutes to hours depending on system architecture, causing not only loss of revenue to the company but also loss of customer trust.
The system is built to handle individual failures, but in big chaotic systems, failure of systems or processes may lead to severe outages. The term Microservice Death Star, refers to an architecture that is poorly designed, has highly interdependent complex systems that are slow, inflexible and can blow up and lead to failure.
Last Few Days to Save Up To 90% on Career Transformation
Ends December 1 – Don't Miss Out!
Structure of microservices at Amazon
In the old world, our system was more simplistic due to monolithic architecture. It was easy to debug errors and consequently fix them. Code changes were shipped once a quarter, or half-yearly. But today, architecture has changed a lot with migration to the cloud where innovation and speed of execution have become part for our system. The system is changing not in order of weeks and days but in order of minutes and hours.
Usage of cloud-based and microservice architecture has provided us with a lot of advantages but come with complexity and chaos which can cause failure. It is an engineer’s responsibility to make the system as reliable as it can be.
Netflix's Way of Dealing with the system has taught us a better approach and has given birth to a new discipline "Chaos Engineering". Let's discuss more about it below.
Chaos Engineering and its Need:
As Defined by a Netflix Engineer:
"Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions"
Reference Link.
Chaos engineering is the process of exposing a software system by introducing disruptive events, such as server outages or API throttling. In this process, we introduce failure scenarios, faults, to test the system’s capability of surviving against unstable and unexpected conditions.
It also helps teams to simulate real-world conditions needed to uncover the hidden issues, monitoring blind spots, and performance bottlenecks that are difficult to find in distributed systems. This method is quite effective in preventing downtime or production outages before their occurrence.
The Need for Chaos Engineering:
How does it benefit?
- Implementing Chaos Engineering improves the resilience of a system.
- By designing and executing Chaos Engineering experiments, we get to know about weaknesses in the system that could lead to outages, which in turn can lose us customers. This helps improve incident response.
- It helps us to improve the understanding of the risk of the system by exposing threats to the system.
Principles of Chaos Engineering:
The term Chaos Engineering was designed by Engineers at Netflix. Chaos Engineering Experiments are designed based on the following four principles:
- Define system’s normal behaviour: First, the steady state of the system is defined, thereby defining some measurable outputs which can indicate the system’s normal behaviour.
- Creating Hypothesis: During an experiment, we need a hypothesis for comparing to a stable control group, and the same applies here too. If there is a reasonable expectation for a particular action according to which we will change the steady state of a system, then the first thing to do is to fix the system so that we accommodate for the action that will potentially have that effect on the system.
- Apply real-world events: Design and create experiments by introducing real-world events like terminating servers, network failures, latency, dependency failure, memory malfunction, etc.
- Observe Results: In this, we will be comparing steady-state metrics with the system after introducing disturbance. For monitoring we can use cloudwatch, Kibana, splunk etc or any other tool which is already part of the system architecture. If there will be a difference in results, it can be used to identify future incidents, and improvements can be made. Otherwise, if there is no difference, it can improve a higher degree of trust and confidence about application among team members.
Difference Between Chaos Engineering And Testing :
When we develop an application, we pass it through various tests that include Unit Tests, Integration Tests, and System Tests.
With Unit testing, we write a unit test case and check the expected behaviour of a component that is independent of all external components whereas Integration testing checks the interaction of individual and inter-dependant components. But even extensive testing does not provide us with a guaranteed error-free system because this testing examines only pre-defined and single scenarios. The results don't cover new information about the application, system behaviour, performance, and properties. This uncertainty increases with the use of microservice architectures, where the system grows with passing time.
Whereas in chaos, it generates a wide range and unpredictable outcome for experimenting on a distributed architecture to build confidence in the system’s capability and withstand turbulent conditions in production. Chaos Testing is a deliberate introduction of failure and faulty scenarios into our system to understand how the system will react and what could be its side effects. This type of testing is an effective method to prevent/minimize outages before they impact the system and ultimately the business.
Chaos Engineering Examples
There are many chaos experiments that we can inject and test our system with, which mainly depend on our goals and system architecture.
Below is a list of the most common chaos tests:
- Simulating the failure of a micro-component and dependency.
- Simulating a high CPU load and sudden increase in traffic.
- Simulating failure of entire AZ(Availability Zone) or region.
- Injecting latency and byzantine failures in services.
- Exhausting memory on instances(cloud services) and allowing fault injection.
- Causing Host Failure.
List of Tools Developed by Netflix:
The Netflix Team has created a suite of tools that support chaos engineering principles and named it the Simian Army. The tools constantly test the reliability, security, or resiliency of its Amazon Web Services infrastructure.
- Chaos Monkey: It is a tool that is used to test the resilience of the system. It works by disabling one system of production and testing how other remaining systems respond to the outage. It is designed to test system stability by enforcing failures and later on checking the response of the system.
The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez
"Imagine a monkey entering a 'data centre', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."
Reference link.
- Latency Monkey: This is useful in testing fault tolerance of service by creating communication delays to provoke outages in the network.
- Doctor Monkey: It checks the health status as well as other components related to health of the system i.e. CPU load to detect unhealthy instances and eventually fixing the instance.
- Conformity Monkey: It finds the instance that doesn't adhere to best practices against a set of rules and sends an email notification to the owner of the instance.
- Janitor Monkey: Ensures cloud service is working free of unused resources and clutter. Disposes of any waste.
- Security Monkey: It is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances.
- Chaos Gorilla: It is similar to Chaos Monkey, but drops full Availability Zone while testing.
Chaos Engineering and DevOps:
When it comes to DevOps and running SDLC, implementing chaos principles in the system helps in understanding system ability against failure, which later on helps in reducing incidents in production.
There are scenarios, where we quickly need to deploy the software in an environment, for all those cases we can perform chaos engineering in distributed, continuous-changing, and complex development methodologies to find unexpected failures.
Advantages:
- Insights received after running chaos testing can lead to a reduction in production incidents for the future.
- Through Chaos Engineering, the team can verify the system's behaviour on failure so that accordingly it takes action.
- Chaos Engineering helps in the testing response of the team to the incident. Also, helps in testing if the raised alert has been notified to the correct team.
- On a high level, Chaos Engineering provides us an advantage by overall system availability. Chaos Experiments make the system more resilient to failures.
- Production outages can lead to huge losses to companies depending on the usage of the system, therefore chaos engineering helps in the prevention of large losses in revenue.
- It helps in improving the confidence and engagement of team members for carrying out disaster recovery methods and makes applications highly reliable.
Disadvantages:
- Implementing Chaos Monkey for a large-scale system and experimenting can lead to an increase in cost.
- Carelessness or Incorrect steps in formation and implementation can impact the application, thereby hampering the customer.
- While implementing the project, it doesn't provide any Interface to track and monitor. It runs through scripts and configuration files.
- It doesn't support all kinds of deployment.
Conclusion:
In the present world of Software Development Lifecycle, chaos engineering has become a magnificent tool which can help organizations to not only improve resiliency, flexibility, and velocity of the system, but also helps in operating distributed system. Along with these benefits, it has also provided us with remediation of the issue before it impacts the system. Implementation of Chaos Engineering is important and should be adopted for better outcomes.
In the above article, we have shared a brief about chaos engineering and demonstrated how it can provide new insights to the system.
Hope this article has provided you with valuable insights about chaos engineering. This is an extensive field and there is a lot more to learn about it.
8 articles published
Kanav is working as SRE in leading fintech firm having experience in CICD Pipeline, Cloud, Automation, Build Release and Deployment. She is passionate about leveraging technology to build innovative ...
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Preparing to hone DevOps Interview Questions?
