Azure Outage Post-Mortem Part 3

My previous blog posts, Azure Outage Post-Mortem – Part 1 and Azure Outage Post-Mortem Part 2,made some assumptions based upon limited information coming from blog posts and twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime tomorrow you should be able to view the session for yourself.

BRK3075 – Preparing for the unexpected: Anatomy of an Azure outage

The official Root Cause Analysis they said will be published soon, but in the meantime here are some tidbits of information gleaned from the session.

The outage was NOT caused by a lightning strike as previously reported. Instead, due to the nature of the storm there were electrical storm sags and swells, which locked out a chiller plant in the 1st datacenter. During this first outage they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter which was not recovered properly, which began an unfortunate series of events.

During this 2nd outage, Microsoft states that “Engineers didn’t triage alerts correctly – chiller plant recovery was not prioritized”. There were numerous alerts being triggered at this time, and unfortunately the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.

Microsoft states that of course redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested, so it was set to manual mode until testing had been completed.

After 45 minutes the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire, and staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing and some hardware was not shut down properly, causing damage to some storage and networking.

After manually resetting the chillers and opening the air handlers the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.

The biggest issue was there was damage to storage. Microsoft’s primary concern is data protection, so short of the enter datacenter sinking into a sinkhole or a meteor strike taking out the datacenter, Microsoft will work to recover data to ensure no data loss. This of course took some time, which extend the overall length of the outage. The good news is that no customer data was lost, the bad news is that it seemed like it took 24-48 hours for things to return to normal, based upon what I read on Twitter from customers complaining about the prolonged outage.

Everyone expected that this outage would impact customers hosted in the South Central Region, but what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.

Azure Service Manager (ASM) – This controls Azure “Classic” resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn’t clear to me why this happened, but it appears that South Central Region hosts some important components of that service which became unavailable.

Visual Studio Team Service (VSTS) – Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.

Postmortem: VSTS 4 September 2018

Azure Active Directory (AAD) – When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling, but the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic, and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.

They ran out of time to discuss this further during the Ignite session, but one feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO) the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center should Microsoft experience another extended outage in the future.

Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper Azure Site Recovery, or application specific replication solutions which give you the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.

 

 

Azure Outage Post-Mortem Part 3

Azure Outage Post-Mortem Part 2

My previous blog post says that Cloud-to-Cloud or Hybrid-Cloud would give you the most isolation from just about any issue a CSP could encounter. However, in this particular failure had Availability Zones been available in the South Central region most of the downtime caused by this natural disaster could have been avoided. Microsoft published a Preliminary RCA of the September 4th South Central Outage.

The most important part of that whole summary is as follows…

“Despite onsite redundancies, there are scenarios in which a datacenter cooling failure can impact customer workloads in the affected datacenter.”

What does that mean to you? If your applications all run in the same datacenter you are susceptible to the same type of outage in the future. In Microsoft’s defense, this really shouldn’t be news to you as this has always been true whether you run in Azure, AWS, Google or even your own datacenter. Failure to plan ahead with data replication to a different datacenter and a plan in place to quickly recover your applications in those datacenters in the event of a disaster is simply a lack of planning on your part.

While Microsoft doesn’t publish exact Availability Zone locations, if you believe this map published here you could guess that they are probably anywhere from a 2-10 miles apart from each other.

Azure Datacenters.png

In all but the most extreme cases, replicating data across Availability Zones should be sufficient for data protection. Some applications such as SQL Server have built in replication technology, but for a broad range of applications, operating systems and data types you will want to investigate block level replication SANless cluster solutions. SANless cluster solutions have traditionally been used for multisite clusters, but the same technology can also be used in the cloud across Availability Zones, Regions, or Hybrid-Cloud for high availability and disaster recovery.

Implementing a SANless cluster that spans Availability Zones, whether it is Azure, AWS or Google, is a pretty simple process given the right tools. Here are a few resources to help get you started.

Step-by-Step: Configuring a File Server Cluster in Azure that Spans Availability Zones

How to Build a SANless SQL Server Failover Cluster Instance in Google Cloud Platform

MS SQL Server v.Next on Linux with Replication and High Availability #Azure #Cloud #Linux

Deploying Microsoft SQL Server 2014 Failover Clusters in #Azure Resource Manager (ARM)

SANless SQL Server Clusters in AWS

SANless Linux Cluster in AWS Quick Start

If you are in Azure you may also want to consider Azure Site Recovery (ASR). ASR lets you replicate the entire VM from one Azure region to another region. ASR will replicate your VMs in real-time and allow you to do a non-disruptive DR test whenever you like. It supports most versions of Windows and Linux and is relatively easy to set up.

You can also create replication jobs that have “Multi-VM Consistency”, meaning that servers that must be recovered from the exact same point in time can be put together in this consistency group and they will have the exact same recovery point. What this means is if you wanted to build a SANless cluster with DataKeeper in a single region for high availability you have two options for DR. One is you could extend your SANless cluster to a node in a different region, or else you could simply use ASR to replicate both nodes in a consistency group.

asr

The trade off with ASR is that the RPO and RTO is not as good as you will get with a SANless multi-site cluster, but it is easy to configure and works with just about any application. Just be careful, if your application exceeds 10 MBps in disk write activity on a regular basis ASR will not be able to keep up. Also, clusters based on Storage Spaces Direct cannot be replicated with ASR and in general lack a good DR strategy when used in Azure.

For a while after Managed Disks were released ASR did not fully support them until about a year later. Full support for Managed Disks was a big hurdle for many people looking to use ASR. Fortunately since about February of 2018 ASR fully supports Managed Disks. However, there is another problem that was just introduced.

With the introduction of Availability Zones ASR is once again caught behind the times as they currently don’t support VMs that have been deployed in Availability Zones.

2018-09-25_00-10-24
Support matrix for replicating from one Azure region to another

I went ahead and tried it anyway. I seemed to be able to configure replication and was able to do a test failover.

ASR-and-AZ
I used ASR to replicate SQL1 and SQL3 from Central to East US 2 and did a test failover. Other than not placing the VMs in AZs in East US 2 it seems to work.

I’m hoping to find out more about this limitation at the Ignite conference. I don’t think this limitation is as critical as the Managed Disk limitation was, just because Availability Zones aren’t widely available yet. So hopefully ASR will pick up support for Availability Zones as other regions light up Availability Zones and they are more widely adopted.

 

 

Azure Outage Post-Mortem Part 2

Azure Outage Post-Mortem – Part 1

The first official Post-Mortems are starting to come out of Microsoft in regards to the Azure Outage that happened last week. While this first post-mortem addresses the Azure DevOps outage specifically (previously known as Visual Studio Team Service, or VSTS), it gives us some additional insight into the breadth and depth of the outage, confirms the cause of the outage, and gives us some insight into the challenges Microsoft faced in getting things back online quickly. It also hints at some some features/functionality Microsoft may consider pursuing to handle this situation better in the future.

As I mentioned in my previous article, features such as the new Availability Zones being rolled out in Azure, might have minimized the impact of this outage. In the post-mortem, Microsoft confirms what I previously said.

The primary solution we are pursuing to improve handling datacenter failures is Availability Zones, and we are exploring the feasibility of asynchronous replication.

Until Availability Zones are rolled out across more regions the only disaster recovery options you have are cross-region, hybrid-cloud or even cross-cloud asynchronous replication. Software based #SANless clustering solutions available today will enable such configurations, providing a very robust RTO and RPO, even when replicating great distances.

When you use SaaS/PaaS solutions you are really depending on the Cloud Service Provider (CSPs) to have an iron clad HA/DR solution in place. In this case, it seems as if a pretty significant deficiency was exposed and we can only hope that it leads all CSPs to take a hard look at their SaaS/PaaS offerings and address any HA/DR gaps that might exist. Until then, it is incumbent upon the consumer to understand the risks and do what they can to mitigate the risks of extended outages, or just choose not to use PaaS/SaaS until the risks are addressed.

The post-mortem really gets to the root of the issue…what do you value more, RTO or RPO?

I fundamentally do not want to decide for customers whether or not to accept data loss. I’ve had customers tell me they would take data loss to get a large team productive again quickly, and other customers have told me they do not want any data loss and would wait on recovery for however long that took.

It will be impossible for a CSP to make that decision for a customer. I can’t see a CSP ever deciding to lose customer data, unless the original data is just completely lost and unrecoverable. In that case, a near real-time async replica is about as good as you are going to get in terms of RPO in an unexpected failure.

However, was this outage really unexpected and without warning? Modern satellite imagery and improvements in weather forecasting probably gave fair warning that there was going to be significant weather related events in the area.

With hurricane Florence bearing down on the Southeast US as I write this post, I certainly hope if your data center is in the path of the hurricane you are taking proactive measures to gracefully move your workloads out of the impacted region. The benefit of a proactive disaster recovery vs a reactive disaster recovery are numerous, including no data loss, ample time to address unexpected issues, and managing human resources such that employees can worry about taking care of their families, rather than spending the night at a keyboard trying to put the pieces back together again.

Again, enacting a proactive disaster recovery would be a hard decision for a CSP to make on behalf of all their customers, as planned migrations across regions will incur some amount of downtime. This decision will have to be put in the hands of the customer.

Slide 2.png
Hurricane Florence Satellite Image taken from the new GOES-16 Satellite, courtesy of Tropical Tidbits

So what can you do to protect your business critical applications and data? As I discussed in my previous article, cross-region, cross-cloud or hybrid-cloud models with software based #SANless cluster solutions are going to go a long way to address your HA/DR concerns, with an excellent RTO and RPO for cloud based IaaS deployments. Instead of application specific solutions, software based, block level volume replication solutions such SIOS DataKeeper and SIOS Protection Suite replicate all data, providing a data protection solution for both Linux and Windows platforms.

My oldest son just started his undergrad degree in Meteorology at Rutgers University. Can you imagine a day when artificial intelligence (AI) and machine learning (ML) will be used to consume weather related data from NOAA to trigger a planned disaster recovery migration, two days before the storm strikes? I think I just found a perfect topic for his Master’s thesis. Or better yet, have him and his smart friends at the WeatherWatcher LLC get funding for a tech startup that applies AI and ML to weather related data to control proactive disaster recovery events.

I think we are just at the cusp of  IT analytics solutions that apply advanced machine-learning technology to cut the time and effort you need to ensure delivery of your critical application services. SIOS iQ is one of the solutions leading the way in that field.

Batten down the hatches and get ready, Hurricane season is just starting and we are already in for a wild ride. If you would like to discuss your HA/DR strategy reach out to me on Twitter @daveberm.

Azure Outage Post-Mortem – Part 1

SQL Server 2017 on Linux Availability Group Split Brain Problem

On July 18th, 2018 Microsoft published this support article with some guidance to help avoid Split Brain when using Availability Groups with SQL Server on Linux.

https://support.microsoft.com/en-us/help/4341219/split-brain-occurs-after-failover-when-using-alwayson-ags-with-externa

Running SQL Server on Linux can have some advantages, including cost savings on the OS if running in Azure. Run the numbers yourself, as the number of cores go up your cost savings year over year can be substantial, considering you are licensing at least two servers for every cluster pair.

https://azure.microsoft.com/en-us/pricing/calculator/

However, why bother saving money if the technology is not rock solid? One of the biggest issues I see with running SQL Server on Linux is the lack of a cohesive HA/DR story. On Windows, Microsoft owns the whole HA stack and SQL Server relies heavily on Windows Server Failover Clustering to support both Availability Groups and Failover Cluster Instances. This has been running well for many years and has a long track record of success stories.

When moving to Linux, Microsoft no longer owns the HA stack at the OS level and depending upon your distro of Linux, you are left trying to piece together open source solutions like Pacemaker, trying to get things to cooperate with SQL Server Availability Groups.

While you may eventually get it to work, I would much rather look to a 3rd party high availability solution like the SIOS Protection Suite for Linux (SPS-L), giving you a tried and true HA solution for your business critical applications running on Linux.

Azure-Linux-SQLServer.png
SQL Server on Linux Cluster in Azure

SPS-L has been protecting business critical applications running on Linux since 1999. It is a full HA/DR solution that monitors and recovers the entire application stack as well as the physical servers and network to ensure your business critical applications are highly available while also maintaining a 3rd copy for disaster recover in a remote datacenter or different geographic region of the cloud.

The other benefit of SPS-L is that it doesn’t require the Enterprise Edition of SQL Server, so there can be a significant cost savings advantage on SQL Server licenses as well. If you consider SQL Server Standard Edition costs $1859 per core vs $7128 per core for SQL Server Enterprise Edition, the cost savings advantage can be significant, depending upon how many cores you need to license.

Below is a video demonstration of SPS-L protecting SQL Server running on Linux in the Azure Cloud. The demonstration shows a SQL Server Standard Edition Cluster being manually failed over between nodes in different Azure Fault Domains as well as SPS-L responding to an unexpected failure.

 

 

SQL Server 2017 on Linux Availability Group Split Brain Problem