Amazon Comes Clean About the Great Cloud Outage

Bookmark and Share
Amazon has posted an essay-length explanation of the cloud outage that took offline some of the Web's most popular services last week. In summary, it appears that human error during an system upgrade meant a redundant backup network for the Elastic Block Service (EBS) accidentally took up the entire network traffic in the U.S. East Region, overloading it, and jamming up the system. 

At the end of a long battle to restore services, Amazon says it managed to recover most data but 0.07 percent "could not be restored for customers in a consistent state". A rather miserly 10-day usage credit is being given to users, although users should check their Amazon Web Services (AWS) control panel to see if they qualify. No doubt several users are also consulting the AWS terms and conditions right now, if not lawyers.  

A software bug played a part, too. Although unlikely to occur in normal EBS usage, the bug became a substantial problem because of the sheer volume of failures that were occurring. Amazon also says their warning systems were not "fine-grained enough" to spot when other issues occurred at the same time as other, louder alarm bells were ringing. 

Amazon calls the outage a "re-mirroring storm." EBS is essentially the storage component of the Elastic Compute Cloud (EC2), which lets users hire computing capacity in Amazon's cloud service.

EBS works via two networks: a primary one and a secondary network that's slower and used for backup and intercommunication. Both are comprised of clusters containing nodes, and each node acts as a separate storage unit. 

There are always two copies of a node, meant to preserve data integrity. This is called re-mirroring. Crucially, if one node is unable to find its partner node to backup to then it'll get stuck until it can find a replacement, and will keep trying until it can find a node. Similarly, new nodes need also to create a partner to be valid, and will get stuck until they can succeed.  

It appears that during a routine system upgrade, all network traffic for the U.S. East Region was accidentally sent to the secondary network. Being slower and of lower capacity, the secondary network couldn't handle this traffic. The error was realized and the changes rolled back, but by that point the secondary network had been largely filled--leaving some nodes on the primary network unable to re-mirror successfully. When unable to re-mirror, a node stops all data access until it's sorted out a backup, a process that ordinarily takes milliseconds but--it would transpire--would now take days, as Amazon engineers fought to fix the system.

Because of the re-mirroring storm that had arisen, it became difficult to create new nodes, as happens normally during everyday EC2 usage. In fact, so many new node creation requests arose, which couldn't be serviced, that the EBS control system also became partially unavailable.  

As a result, even more nodes attempted to re-mirror and the situation became worse. The EBS control system was again adversely affected.

Fixing the problem was problematic because EBS was configured not to trust any nodes it thought had failed. Therefore, the Amazon engineers had to physically locate and connect new storage in order to create new nodes to meet the demand--around 13 percent of existing volumes, which is likely a huge amount of storage. Additionally, they had reconfigured the system to avoid any more failures, but this made bringing the new hardware online very difficult.