Amazon Web Services is no stranger to scale. Gartner estimated earlier this year that it has ten times the amount of IaaS capacity as all of its next 14 competitors in the market – combined. But it seems that AWS wasn’t quite prepared for the impact that a new feature in its DynamoDB database would have, causing a significant outage on Sunday in its US-East region.
First a little background on the company’s NoSQL database: Amazon launched DynamoDB in 2012 as a low-latency, highly-scalable service (check out my in-depth story about DynamoDB here). As opposed to a traditional SQL database, it’s meant to have faster throughput and more consistent performance.
Late last year AWS added a new feature to DynamoDB named Global Secondary Indexes (GSI). When data is initially loaded into DynamoDB there is a key and a value associated with that key. GSI allows users to reorganize data with different values and keys; it’s a handy tool for quickly reorganizing data and running new queries.
When operating normally, DynamoDB has a complex system that involves metadata that tracks various customer tables and partitions, which are housed in storage servers that are constantly checking to ensure they are up to date. In the early morning hours on Sunday, there was a network disruption that prevented the storage servers from getting its metadata to ensure it was up to date.
Here’s where the GSIs come in. GSIs have substantially increased the amount of metadata that DynamoDB is handling because customers have multiple different configurations of data. The storage servers were requesting metadata but a network error degraded the ability for the system to service that information. A bottleneck cascaded into an outage. Error rates in DynamoDB system rose to 55%. AWS attempted to resolve the issue by adding storage capacity, but it didn’t work. AWS basically had to shut down data requests to reset the system and add capacity. Doing so allowed error rates to recover back down to 0.15-0.25%.
In response, AWS says it’s significantly increasing the capacity of its metadata and storage services. New performance monitoring controls will be installed to (hopefully) catch issues like this sooner, if not even before they happen. AWS is exploring how to geographically distribute the service to isolate future problems more.
What it means for AWS users
Be prepared – that’s the big takeaway from all this. BMC SVP and Chief Architect Bill Platt says customers should monitor any cloud service they’re using by the second to detect failures. Plans should be in place to automatically shift workloads to healthy instances when service disruptions are found. “Speed to reaction has never been more critical,” he says.
“We apologize for the impact to affected customers,” AWS officials wrote. “While we are proud of the last three years of availability on DynamoDB (it’s effectively been 100%), we know how critical this service is to customers, both because many use it for mission-critical operations and because AWS services also rely on it. For us, availability is the most important feature of DynamoDB, and we will do everything we can to learn from the event and to avoid a recurrence in the future.”
One more thing
If it looks like a duck, walks like a duck and quacks like a duck… then it’s a duck. Sunday’s situation looked like an outage, impacted customers like an outage and brought down many AWS services like an outage… so it’s an outage, right? Not according to AWS. Perhaps it’s just semantics, but AWS is not calling it an outage, instead referring to it as a “service event” and a “disruption.” Can’t we call it what it was: An outage?
By Brandon Butler