/Chaos Engineering

That Time We Unplugged A Data Center To Test Our Disaster Readiness

- Krishelle Hardson-Hurley Ross Delinger Tong Pham tl;dr: "One way of communicating our preparedness to our customers is through a metric called Recovery Time Objective (RTO). RTO measures the amount of time we promise it will take to recover from a catastrophic event." Dropbox reduced its RTO by "more than an order of magnitude," as discussed in this post.

featured in #313


Building Resilient Services At Prime Video With Chaos Engineering

- Varun Jewalikar Adrian Hornsby tl;dr: "A simple approach for fault injection in systems utilizing Amazon EC2 and ECS, and its integration with a load-testing suite to validate the countermeasures put in place to prevent dependency and resource exhaustion failures."

featured in #201


Chaos Engineering Traps

- Nora Jones tl;dr: A guide on how to approach Chaos Engineering - the increasingly common practice of simulating unexpected real world conditions on distributed systems to test for vulnerabilities - along with the common traps. Click this tweet if paywalled.

featured in #138