You may have heard about the recent storage outage that impacted some instances in the US East region back on March 16th. A root cause analysis of the outage is posted here.
March 16th US East Storage Outage
Customer impact: A subset of customers using Storage in the East US region may have experienced errors and timeouts while accessing their storage account in a single Storage scale unit
You might be asking, “What is a single Storage scale unit”. Well, you can think of it as a single storage cluster, or single SAN, or however you want to think about it. I don’t think Azure publishes their exact infrastructure, but you can probably assume that behind the scenes they are using Scale Out File Servers for backend storage.
So the question is, how could I have survived this outage with minimal downtime? If you read further down that root cause analysis you come across this little nugget.
Virtual Machines using Managed Disks in an Availability Set would have maintained availability during this incident.
What’s Managed Disks you ask? Well, just on February 8th Corey Sanders announced the GA of Managed Disks. You can read all about Managed Disks here. https://azure.microsoft.com/en-us/services/managed-disks/
The reason why Managed Disks would have helped in this outage is that by leveraging an Availability Set combined with Managed Disks you ensure that each of the instances in your Availability Set are connected to a different “Storage scale unit”. So in this particular case, only one of your cluster nodes would have failed, leaving the remaining nodes to take over the workload.
Prior to Managed Disks being available (anything deployed before 2/8/2016), there was no way to ensure that the storage attached to your servers resided on different Storage scale units. Sure, you could use different storage accounts for each instances, but in reality that did not guarantee that those Storage Accounts provisioned storage on different Storage scale units.
So while an Availability Set ensured that your instances reside in different Fault Domains and Update Domains to ensure the availability of the instance itself, the additional storage attached to each instance really represented a single point of failure. Although the storage itself is highly resilient, with three copies of your data and geo-redundant options available, in this case with a power failure the entire Storage scale unit went down along with all the servers attached to it.
So long story short…migrate to Managed Disk as soon as possible in order to help minimize downtime
And if you really want to minimize downtime you should consider Hybrid Cloud Deployments that span cloud providers or on-prem to cloud!