HyperScale Data Centers Featured Article


Microsoft Azure Update Causes Global Service Outage


November 24, 2014

Microsoft Azure is a cloud computing platform that relies on Microsoft data servers spread throughout the globe. On Tuesday, Nov. 18, Azure servers experienced nearly half a day's worth of outage that affected customers in several regions on multiple continents.

According to a post written at the official Azure blog by Jason Zander, the Azure corporate vice president, a massive service outage began Tuesday at 12:51 a.m. and lasted until 11:45 a.m. He provides information about a performance update that went awry and ultimately affected services with virtual machines, data storage and search, customer websites, and the Visual Studio Online platform. Although, he said, testing took place previous to the update attempt, which meant to ensure that no problems would occur, the updates caused an infinite loop in storage blobs that resulted in the widespread service disruption.

“During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting,” Zander said while describing the pre-update test procedures defined as “flighting.”

“The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues. Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update,” he continues.

Below the text of Zander's apology and brief explanation of the outage, the blog provides futher details of the operation. Those details are such that Microsoft said it wished to deploy updates that would increase the performance and reduce the CPU footprint of Azure table front-ends. Pre-tests went as expected; the update worked as desired on several production clusters. After attempting to update the entire production service, a bug in the blob front-ends appeared and then caused the blob front-ends to enter an infinite loop where they could not accept traffic.

This affected the following regions: Central U.S., East U.S., East U.S. 2, North Central U.S., South Central U.S., West U.S., North Europe, West Europe, East Asia, Southeast Asia, Japan East, and Japan West. The reason so many regions were affected is because the Microsoft team attempted to apply the update to all servers at once instead of proceeding in increments. The blog post notes this error:

“Unfortunately the issue was wide spread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches,” it says.

This detail is one picked up in the comments readers left below the blog's text. They point out that the error could have been far less severe if standard protocol was followed and not all servers were scheduled to be updated at the same time.

From this point forward, Microsoft says it intends to enforce that standard protocol. It will also seek to fix the bug with the blob front-ends before attempting to apply the update a second time.

A report at Healthcare IT News which references a Ponemon Institute figure that suggests power outages could cost businesses $7,900 per minute of outage. Although the Ponemon study references healthcare providers, similar costs may result at firms in other markets. Spread across the entire globe as Microsoft's disruption this past week, the amount of lost revenue due to downtime could reach into the billions. Luckily, service outages of this length and depth are rare. Even so, customers will not be able to get those hours back.




Edited by Maurice Nagle

Article comments powered by Disqus







Click here to share your opinion - What is the "next big thing" to software define in your enterprise or data center?






Featured Blog Entries

Day 4, Cisco Live! 2017 - The Wrap

Day 4 was the final day of our first ever Cisco Live! We had a great show, with many great conversations and new connections with existing and potential end users, resellers, partners and job hunters.

Day 3, Cisco Live! 2017

Day 3 of Cisco Live is history! For Fiber Mountain, we continued to enjoy visits from decision makers and influencers who were eager to share their data center and structured cabling challenges.

Day 2, Cisco Live! 2017

Tuesday was Day 2 of Cisco Live for Fiber Mountain and we continued to experience high levels of traffic, with many high value decision makers and influencers visiting our booth. One very interesting difference from most conferences I attend is that there are no titles on anyone's show badges. This allows open conversations without people being pretentious. I think this is a very good idea.

Day 1, Cisco Live! 2017

Fiber Mountain is exhibiting at Cisco Live! In Las Vegas for the first time ever! Our first day was hugely successful from just about any perspective - from quantity and quality of booth visitors to successful meetings with customers.

Industry News