Networking Support

Unpacking AWS Outages: System Design Lessons from Post-Event

November 10, 2024

Demystifying Cascading Failures and Building Resilient Systems

As a seasoned IT professional, I’ve had the privilege of delving into AWS’s post-event summaries (PES) on major outages. These reports not only demonstrate AWS’s open engineering culture but also provide invaluable insights for building more resilient systems. In this article, we’ll explore four captivating AWS outage cases and uncover the key strategies for designing fault-tolerant, highly available infrastructure.

The Dangers of Dependency Management: Lessons from the 2011 Amazon EC2/EBS Outage

The April 21, 2011, Amazon EC2/EBS event in the US East Region offers a prime example of how dependency management and cascading failures can wreak havoc on a system. A seemingly simple network configuration change, intended to upgrade the capacity of the primary network, triggered a cascading series of events that ultimately impacted the entire US East Region.

The issue began when the network change caused many Elastic Block Store (EBS) nodes to disconnect from their replicas. When these nodes reconnected, they attempted to replicate their data on other nodes, quickly overwhelming the EBS cluster’s capacity. This “remirroring storm” left many EBS volumes “stuck,” unable to process read and write operations.

The blast radius of this incident was initially confined to a single Availability Zone, with about 13% of the volumes in a “stuck” state. However, the outage soon spread to the entire region as the EBS control plane, responsible for handling API requests, became dependent on the degraded EBS cluster. The increased traffic from the remirroring storm overwhelmed the control plane, making it intermittently unavailable and affecting users across the region.

This incident underscores the importance of building resilient systems and carefully managing dependencies. The EBS control plane’s reliance on a single Availability Zone and the absence of back-off mechanisms in the remirroring process were critical factors in the cascading failures.

Localized Outage, Regional Impact: Lessons from the 2012 AWS Services Disruption

The June 29, 2012, AWS services event in the US East Region exemplifies how a localized power outage can trigger a region-wide service disruption due to complex dependencies. A severe electrical storm caused a power outage in a single data center, impacting a small portion of AWS resources in the US East Region – a single-digit percentage of the total resources in the region.

Initially, the power outage was confined to the affected Availability Zone. However, it soon led to degrading service control planes that manage resources across the entire region. While these control planes aren’t required for ongoing resource usage, their degradation hindered users’ ability to respond to the outage, such as using the AWS console to try moving resources to other Availability Zones.

This event underscores the critical importance of designing fault-tolerant systems, even when hosting applications in the cloud. The potential for cascading failures when a small percentage of infrastructure is impacted, combined with the interconnected nature of services and dependencies on control planes, can significantly amplify the impact of localized outages.

Centralized Services as Single Points of Failure: Lessons from the 2014 Amazon SimpleDB Disruption

The Amazon SimpleDB service disruption on June 13, 2014, in the US East Region demonstrates how a seemingly minor issue can escalate into a significant disruption due to dependencies on a centralized service. A power outage in a single data center caused multiple storage nodes to become unavailable, leading to a spike in load on the internal lock service, which manages node responsibility for data and metadata.

While the lock service is replicated across multiple data centers, the sudden load spike was too high for the system to handle. Initially, the impact was confined to the storage nodes in the affected data center. However, the increased load on the centralized lock service, crucial for all SimpleDB operations, caused cascading failures that affected the entire service.

This outage highlights the critical importance of addressing single points of failure when designing systems. The centralized nature of the lock service, intended for coordination, became a single point of failure. A more distributed or load-balanced approach for the lock service could have mitigated the impact of the simultaneous node failures.

Capacity Additions and Unexpected Behaviors: Lessons from the 2020 Amazon Kinesis Outage

The Amazon Kinesis event in the US East Region on November 25, 2020, is a perfect example of how adding capacity can unexpectedly trigger a cascade of failures due to unforeseen dependencies and resource limitations. A small capacity addition to the front-end fleet of the Kinesis service led to unexpected behavior in the front-end servers responsible for routing requests.

The newly added servers exceeded the maximum allowed threads due to operating system configuration: each front-end server creates operating system threads for each of the other servers in the front-end fleet – a requirement for services to learn about new servers added to the cluster. This ultimately caused cache construction failures and prevented servers from routing requests to the back-end clusters.

Although the initial trigger was a capacity addition intended to enhance performance, the resulting issue affected the entire Kinesis service in the US East Region. The dependency of many AWS services on Kinesis amplified the impact significantly.

This incident highlights the importance of thoroughly understanding dependencies, resource limitations, and potential failure points when making changes to a system, even those intended to improve capacity or performance. Robust testing and monitoring are crucial to identify and mitigate such unexpected behaviors.

Embracing Complexity, Pursuing Simplicity

The AWS outage reports reveal a fundamental truth about system design: complexity and interdependency can be both our greatest strengths and our most significant vulnerabilities. As we build and scale our systems, we must strive for simplicity, resilience, and a thorough understanding of our dependencies.

To ensure our systems are prepared for the unexpected, it’s crucial to internalize and act upon the lessons these incidents teach us:

Manage Dependencies Carefully: Understand the critical dependencies in your system and design robust, loosely coupled architectures that can withstand the failure of individual components.
Anticipate Cascading Failures: Recognize the potential for small issues to escalate into large-scale outages due to interconnected services and shared dependencies. Implement fail-safe mechanisms and back-off strategies to prevent cascading failures.
Address Single Points of Failure: Identify and eliminate centralized services or components that could become single points of failure. Favor distributed, load-balanced, and redundant designs.
Thoroughly Test and Monitor Changes: Rigorously test system changes, even those intended to enhance capacity or performance, to uncover unexpected behaviors and potential failure points. Implement comprehensive monitoring to quickly detect and respond to anomalies.
Foster a Culture of Continuous Learning: Embrace the lessons from outage reports and post-incident reviews. Continuously learn from past mistakes and apply those insights to enhance the resilience of your systems.

The path to reliable systems is paved with continuous learning and adaptation. Let’s embrace these lessons and push the boundaries of what our systems can achieve, ensuring that we are always prepared for the unexpected.

To explore more in-depth networking support and IT solutions, visit ITFix.