Internal documents show how Amazon scrambled to fix Prime Day glitches
Eugene Kim |
@eugenekim222
Published 2:43 PM ET Thu, 19 July 2018Updated 6:16 PM ET Thu, 19 July 2018CNBC.com
- Amazon wasn't able to handle the traffic surge and failed to secure enough servers to meet the demand on Prime Day, according to expert review of internal documents obtained by CNBC.
- That led to a cascading series of failures, including a slowdown in its internal computation and storage service called Sable and other services that depend on it, including Prime, authentication and video playback.
- Amazon immediately launched a scaled-down "fallback" front page to reduce workload and temporarily killed all international traffic too.
Amazon.com founder and CEO Jeff Bezos
Amazon failed to secure enough servers to handle the traffic surge on Prime Day, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic, according to internal Amazon documents obtained by CNBC.
And that took place within 15 minutes of the start of Prime Day — one of Amazon's biggest sales days every year.
The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. “Currently out of capacity for scaling,” one of the updates said about the status of Amazon’s servers, roughly an hour after Prime Day’s launch. “Looking at scavenging hardware.”
A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.
Other teams, including Alexa, Prime Now and Twitch, also reported problems, while some warehouses said they weren’t even able to scan products or pack orders for a period of time.
The documents give a rare look into how Amazon responded to the higher-than-expected traffic surge on Prime Day, which caused glitches
across the site for hours. It also illustrates the difficulty Amazon faced in dealing with the demand, despite its deep experience running a massive-scale website and one of the largest cloud computing platforms in the world.
“More people came in than Amazon could handle,” Matthew Caesar, a computer science professor at the University of Illinois and co-founder of cybersecurity firm Veriflow, said after CNBC shared the details of the documents. “And Amazon couldn’t use all the resources they had available because there was a bug or some other issue with their software."
Although the outage lasted for hours on Prime Day, the impact on overall sales was minimal. Amazon said it was the “
biggest shopping event” in company history, with over 100 million products purchased by Prime members during the 36-hour event. Half a dozen sellers who spoke to CNBC also said they were happy with this year’s Prime Day sales, even after dealing with the downtime.
Amazon hasn’t said much publicly about the outage. It issued a single statement two hours after the site crash, succinctly saying “some customers are having difficulty shopping” and that it was working to “resolve the issue quickly.”
In an internal email seen by CNBC, Jeff Wilke, Amazon’s CEO of worldwide retail, noted that his team was “disappointed” about the site issues and said the company’s already working on ways to prevent this from happening again. Then he highlighted all the ways that Prime Day was a success.
“Tech teams are already working to improve our architecture, and I’m confident we’ll deliver an even better experience next year,” he wrote in the email.
Amazon declined to comment.
The first hour
Amazon, based in Seattle, Washington, started seeing glitches across its site as soon as Prime Day launched at noon local time on Monday. In response, Amazon rushed to its backup plans and made quick changes during the first hour of the event.
Updates made at 12 p.m. say Amazon switched the front page to a simpler “fallback” page, as it saw a growing number of errors. Amazon’s front page on Prime Day looked oddly simple and rather poorly designed, noted Caesar, saying a simplified web page was likely put up to reduce load on their servers.
By 12:15 p.m., Amazon decided to temporarily cut off all international traffic to “reduce pressure” on its Sable system, and by 12:37 p.m., it reopened the default front page to only 25 percent of traffic. At 12:40 p.m., Amazon made certain changes that improved the performance of Sable, but just two minutes later, it went back to “consider” blocking approximately 5 percent of “unrecognized traffic to U.S.,” according to one of the documents.
Even after making these changes, Amazon’s site “error rate” continued to worsen until about 1:05 p.m., before drastically improving at 1:10 p.m., an internal site performance chart shows. Some parts of Amazon saw order rates that were “significantly higher than expected" by a factor of two, one of the updates said. One person familiar with the matter described the office scene as “chaotic” and said at one point more than 300 people tuned in to an emergency conference call.
“They are obviously scrambling, on short notice, to restore services,” said Henning Schulzrinne, a computer science professor at Columbia University and the former CTO of the Federal Communications Commission, after CNBC shared details of the documents. “These problems tend to feed on themselves — people retry loading, making the problem worse, or services complete partially. So shutting off services is often the better, but obviously bad, option.”
Internal system "Sable" on red alert
Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 “hosts,” or virtual servers, because of the extra traffic.
Caesar says the root cause of the problem may have to do with a failure in Amazon’s auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
“If their auto-scaling was working, things would have scaled automatically and they wouldn't have had this level of outage,” Caesar said. “There was probably an implementation or configuration error in their automatic scaling systems.”
Due to the lack of server power, Amazon saw extra pressure on Sable, which is an internal storage and computational system that plays a critical role running multiple services across the site, according to documents seen by CNBC. Sable is used by 400 teams across Amazon and handled a total of 5.623 trillion service requests, or 63.5 million requests per second, during last year’s Prime Day, according to an internal document.
Sable was given a “red” emergency alert in one of the status updates, made a little past 1 p.m., which said it’s “running hot” and “cannot scale.” It also said other services, such as Prime, authentication and video playback, were being “impacted by Sable.”
“We are experiencing failures mostly related to Sable,” one of the updates said.
Carl Kesselman, a computer science professor at USC, said Amazon’s response to the outage was rather impressive because in many cases the site would have crashed entirely under those circumstances.
“Amazon is operating at a scale we haven’t operated before,” he said. “It’s not clear there’s a bad guy or an obvious screw-up. It’s just we’re in uncharted territory, and it’s amazing it didn’t just fall over.”
This year’s Prime Day was the first one run by Neil Lindsay, Amazon’s VP of worldwide marketing and Prime. Lindsay took over the Prime team after the former lead, Greg Greeley,
left the company for Airbnb earlier this year.
https://www.google.com/amp/s/www.cn...-caused-prime-day-crash-company-scramble.html