Lightning Never Strikes Twice: Surviving the #Azure Cloud Outage

Yesterday morning I opened my Twitter feed to find that many people were impacted by an Azure outage. When I tried to access the resource page that described the outage and the current resources impacted even that page was unavailable. @AzureSupport was providing updates via Twitter.

The original update from @AzureSupport came in at 7:12 AM EDT

Azure Outage 2

Looking back on the Twitter feed it seems as if the problem initially began an hour or two before that.

Azure Support 10

It quickly became apparent that the outages had a wider spread impact than just the SOUTH CENTRAL US region as originally reported. It seems as if services that relied on Azure Active Directory could have been impacted as well and customers trying to provision new subscriptions were having issues.

Azure 11

And 24 hours later the problem has not been completely resolved and it according to the last update this morning…

Azure Outage 1

Untitled design (6)

So what could you have done to minimize the impact of this outage? No one can blame Microsoft for a natural disaster such as a lightning strike. But at the end of the day if your only disaster recovery plan is to call, tweet and email Microsoft until the issue is resolved, you just received a rude awakening. IT IS UP TO YOU to ensure you have covered all the bases when it comes to your disaster recovery plan.

While the dust is still settling on exactly what was impacted and what customers could have done to minimize the downtime, here are some of my initial thoughts.

Availability Sets (Fault Domains/Update Domains) – In this scenario, even if you built Failover Clusters, or leveraged Azure Load Balancers and Availability Sets, it seems the entire region went offline so you still would have been out of luck. While it is still recommended to leverage Availability Sets, especially for planned downtime, in this case you still would have been offline.

Availability Zones – While not available in the SOUTH CENTRAL US region yet, it seems that the concept of Availability Zones being rolled out in Azure could have minimized the impact of the outage. Assuming the lightning strike only impacted one datacenter, the other datacenter in the other Availability Zone should have remained operational. However, the outages of the other non-regional services such as Azure Active Directory (AAD) seems to have impacted multiple regions, so I don’t think Availability Zones would have isolated you completely.

Global Load Balancers, Cross Region Failover Clusters, etc. – Whether you are building SANLess clusters that cross regions, or using global load balancers to spread the load across multiple regions, you may have minimized the impact of the outage in SOUTH CENTRAL US, but you may have still been susceptible to the AAD outage.

Hybrid-Cloud, Cross Cloud – About the only way you could guarantee resiliency in a cloud wide failure scenario such as the one Azure just experienced is to have a DR plan that includes having realtime replication of data to a target outside of your primary cloud provider and a plan in place to bring applications online quickly in this other location. These two locations should be entirely independent and should not rely on services from your primary location to be available, such as AAD. The DR location could be another cloud provider, in this case AWS or Google Cloud Platform seem like logical alternatives, or it could be your own datacenter, but that kind of defeats the purpose of running in the cloud in the first place.

Software as a Service – While Software as service such as Azure Active Directory (ADD), Azure SQL Database (Database-as-Service) or one of the many SaaS offerings from any of the cloud providers can seem enticing, you really need to plan for the worst case scenario. Because you are trusting a business critical application to a single vendor you may have very little control in terms of DR options that includes recovery OUTSIDE of the current cloud service provider. I don’t have any words of wisdom here other than investigate your DR options before implementing any SaaS service, and if recovery outside of the cloud is not an option than think long and hard before you sign-up for that service. Minimally make the business stake owners aware that if the cloud service provider has a really bad day and that service is offline there may be nothing you can do about it other than call and complain.

I think in the very near future you will start to hear more and more about cross cloud availability and people leveraging solutions like SIOS DataKeeper to build robust HA and DR strategies that cross cloud providers. Truly cross cloud or hybrid cloud models are the only way to truly insulate yourself from most conceivable cloud outages.

If you were impacted from this latest outage I’d love to hear from you. Tell me what went down, how long you were down, and what you did to recover. What are you planning to do so that in the future your experience is better?

Lightning Never Strikes Twice: Surviving the #Azure Cloud Outage

“Incomplete Communication with Cluster” with local Storage Space for SQL Server cluster

When building a SANless SQL Server cluster with SIOS DataKeeper, or when configuring Always On Availability Groups for SQL Server, you may consider striping together multiple disk in a Simple Storage Space (RAID 0) for performance. This is very commonly done in the cloud where each instance typically his backed by hardware resiliency, so RAID 0 is not really all that risky.

For instance, I had a recent customer in AWS that wanted to max out his IOPS to 80,000, the maximum IOPS currently available to a single instance. Now keep in mind, only the largest EBS optimized instance sizes supports 80,000 IOPS, so you want to make sure you know what maximum IOPS your particular instance size supports.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

In this case we had ac5.18xlarge instance which does support 80,000 IOPS. However, any individual EBS Provisioned IOPS volume only supports up to 32,000 IOPS. The only way to achieve 80,000 IOPS when writing to any single volume is to strip three of these volumes together in a Simple Storage Space.

Herein lies the rub, if you try to do that in an existing cluster things are going to go haywire pretty fast. Fellow MVP Joey D’Antoni recently blogged about the issue and it appears to still be an issue in the Windows Server 2019 preview.

Just as Joey suggests, I always advise my customers to build out the nodes and any Storage Spaces BEFORE they start the clustering process. This makes the process go much smoother. It also allows the customer to have some time to benchmark the server’s performance before they add any replication, to  ensure everything is working as expected.

 

 

“Incomplete Communication with Cluster” with local Storage Space for SQL Server cluster

iCalc events still appearing on Wordpress Upcoming Events Widget even after deleted

I sync an iCal calendar feed to a WordPress site using the Upcoming Events Widget. I do this primarily because it is the only calendar widget available with the free version of WordPress this non-profit is using.

The issue I had is that I deleted an event last week, yet it still shows in my WordPress site through this sidebar widget. I would think it was a refresh issue, but edits to other events at the same time have already sync’d to the WordPress site.

wordpress.jpg
This event was deleted from my iCalc calendar, yet it still appears in my Upcoming Events Widget

I even tried deleting the Widget and adding a new one and the new widget also shows the events that I deleted.

I was beginning to get a bit frustrated and of course I had a band parents breathing down my neck questioning my ability to manage the band parent association website. Okay, not really, but they did point out the discrepancy on the calendar and I assumed that they just thought I was an idiot.

After a bit of Googling turned up nothing, it occurred to me that this was a reoccurring event. When I deleted that event, I simply delete that particular occurrence of the events, not the whole series of events.

Sure enough, after I deleted the whole series of recurring events the event disappeared from the Upcoming Events Sidebar widget about an hour later. Longs story short, the Upcoming Events Widget in WordPress does not handle exceptions to recurring events reliably.

iCalc events still appearing on Wordpress Upcoming Events Widget even after deleted

Your student could be the next Doogie Howser of Cloud Computing with free training and cloud computing resources

Students with any interest in Information Technology or Computer Science are going to be joining a world dominated by Cloud Computing. And of course the major cloud service providers (CSP) would all love to see the young people embrace their cloud platform to host the next big thing like Facebook, Instagram or SnapChat. The top three CSP all have free offerings for students, hoping to win their minds and hearts.

But before you jump right in to cloud computing, the novice student might want to start with some basic fundamentals of computer programming at one of the many free online resources, including Khan Academy.

feature_khanacademy.png

Microsoft is offering free Azure services for students. There are two different offerings. The first is targeted at high school students ages 13+ and the second is geared towards college students 18+.

microsoft-azure-1.png

Microsoft Azure for Students Starter Offer is for those high school students that are interested in building applications in the cloud. While there are not as many free services or credits as being offered at the college level, there is certainly enough available for free to really get some hands on experience with some cutting edge technology for the self starter. How cool would it be for your high school to start a Cloud Computing Club, or to integrate this offering into some of the IT classes they may already be taking.

Azure for Students is targeted at the college level student and has many more features available for free. Any student in computer science or information technology should definitely get some hands on experience with these cutting edge cloud technologies and this is the perfect way to do it with no additional out of pocket expense.

A good way to get introduced to the Azure Cloud is to start with some free online training courses Microsoft delivers in partnership with Pluralsight.

logo_aws-educate.812809f63186598d26a56d443d829afa390566d1.png

AWS Educate. Not to be outdone, AWS also offers some free cloud services to students and educators. These seem to be in terms of free cloud credits, which if managed properly can go a long way. AWS also delivers an educational program that can be combined with an AP class in Computer Science if your high school wants to participate.

 

Google-Cloud-Platform.png

Google Cloud Platform (GCP) also has education grants available for computer science majors at accredited universities. These seem to be the most restrictive of the three as they are available for Computer Science Majors only at accredited universities.

GCP does also offer training, but from what I can find I don’t see any free training offerings. If you want some hands on training you will have to register for some classes. The plus side of this is that these classes all seem to be instructor led, either online or in an actual classroom. The downside is I don’t think a lot of 13 year olds are going to shell out any money to start developing on the CGP when there are other free training opportunities available on AWS or Azure.

For the ambitious young student, the resources are certainly there for you to be the next Doogie Howser of Cloud Computing.

Doogie Howser_Cloud MD_.png

 

Your student could be the next Doogie Howser of Cloud Computing with free training and cloud computing resources

Help! I can’t connect to my SQL Server multi-subnet failover cluster

I get that kind of call or email from customers all the time. I have a generic response as follows…

This has everything you need to know.

They don’t go into great detail about what to do if your connection does not support multisubnetfailover=true. If your connection does NOT support that parameter, then set registerallprovidersip to false and cleanup DNS. That procedure is described best here.
I figure I get this question often enough I probably should just flesh out my response a bit, hence the reason for this post.
In general people just aren’t aware of how multi-subnet failover clusters work. Multi-subnet failover clustering support was added in Windows Server 2012 with the addition of the “OR” technology when defining cluster resource dependencies. This allowed people to allow a Cluster Name resource to be dependent upon IP Address x.x.x.x OR IP Address y.y.y.y.
x.x.x.x would be an a cluster IP resource valid in Subnet A and y.y.y.y would be a cluster IP address valid in Subnet B. Only one address will be online at any given time, whichever address was valid for the subnet the resource was currently running on.
Microsoft SQL Server started supporting this concept starting with SQL Server 2012 with both failover cluster instances (FCI) using 3-party SANless clustering solutions like SIOS DataKeeper and SQL Server Always On Availability Groups.
By default if you create a SQL Server multi-subnet failover cluster the cluster should be automatically configured optimally, including setting up the two IP addresses, adding two A records to DNS and setting the registerallprovidersIP to true. However, on the client end you need to tell it that you are connecting to a multi-subnet failover cluster, otherwise the connection won’t be made.

Configuring the client

Configuring the client is done by adding multisubnetfailover=true to the connection string. This Microsoft documentation is a great resource, but if you just search for multisubnetfailover=true you will find a lot of information about that setting.
However, not every application will support adding that to the connection string. If you find yourself in that situation you should ask your application vendor to add support for that or show you how to do it.
However, all is not lost if you find yourself in that situation. You will want to change the behavior of the cluster so that upon failover DNS is update so that the single A record associated with the cluster client access point is updated with the new IP address. This is in lieu of having two A records in DNS, one with each cluster IP address, which is the default behavior in an multi-subnet cluster.
This article reference SharePoint, you can ignore that, the rest of the article is pretty well written to describe the process you should follow.
The highlights of that article are as follows…
Get-ClusterResource “[Network Name]” | Set-ClusterParameter RegisterAllProvidersIP 0
After restarting the cluster-name-object (basically restarting the role) & cleaning up all “A” records manually (clean-up isn’t done automatically) we can see our old A-records are still in DNS so we’ll need to delete those manually.
In addition to those steps I’d advise you to reduce the TTL on the HostRecordTTL as described in this article.
The highlight of that article is as follows.
PS C:\> Get-ClusterResource -Name cluster1FS | Set-ClusterParameter -Name HostRecordTTL -Value 300
With a Value of 300 you could potentially be waiting up to 5 minutes for your clients to reconnect after a failover, or even longer if if have a large Active Directory infrastructure and AD replication takes some time to update all the DNS servers across your infrastructure.
You are going to want to figure out what the optimal TTL is to facilitate quick client reconnections without over burdening your DNS servers with a bunch of DNS Lookup requests.
This type of configuration is common in disaster recovery configurations where your DR site is in a different subnet. It is also very common in HA deployments in AWS because different Availability Zones are in different subnets.
Let me know if you have any questions. You can always reach me on Twitter @daveberm
Help! I can’t connect to my SQL Server multi-subnet failover cluster

First impression of the new Gmail Interface

As a Mac user I have been a little frustrated with using Gmail as there really is no Gmail user experience on the Mac as good as running Outlook 2016 on Windows 10 with G Suite Sync for Microsoft Outlook. I had high hopes for Outlook 2016 for Mac and I was an early adopter once they lit up the preview of Gmail integration that allowed you to sync both calendars and email, but it fell short in one major area.

outlook-logo

There were a few rough edges in the preview, but for the most part it was working as advertised. However, the one big stumbling block for me is there is no way to limit the amount of email that Outlook will download and store locally. This is a very basic feature that Outlook on Windows users get when they use G Suite Sync to sync the Gmail account to Outlook on Windows.

When they implement that feature I’ll give it another try, but for now Outlook really isn’t an option for a email hoarder like me, especially given the relatively small size of the flash drives in the MacBook Pro.  If you are reading this and they added that feature drop me a note and I’ll give it another try.

For a while I was keeping a Windows VM around just to use Outlook. After a while I realized that this was just a little ridiculous and I decided to cut the cord and go cold turkey…no more Outlook.  I forced myself to embrace the Google web interface and just lived without the niceties of how well Outlook Contacts, Calendar and Emails integrated seemlessly.

Honestly I hated it. My inbox never seemed to get empty, and all my tools that integrated WebEx, Calendars and email just weren’t there, or maybe I just didn’t know how to use it properly, or know what Chrome extensions to use. But I suffered through and sucked it up.

Inbox-iOS-app-by-Google

Then I stumbled upon Inbox by Google, I think by accidentally typing inbox.google.com rather than mail.google.com. My first impression was “wow, this looks really nice”. I didn’t know much about it so I did a little reading and quickly downloaded the iOS version. Obviously if you know anything about Inbox you know it has actually been publicly available since May of 2015.

If I think way back to 2014 I think I may have had a friend send me an invite to the private preview and I may have even looked at it. But at that time I was a tried and true Windows 7 and Office 2010 user, so some new web interface to check Gmail really didn’t interest me.

But since I have been torturing myself with the regular Gmail interface, I have come to find a brand new appreciation for Inbox. The initial setup was a little arduous as I had a long history of emails that I had to “Sweep” into the “Done” pile, but I persevered and was awarded with the first clean inbox I have seen in years.

IMG_4833

I have to tell you the feeling of a clean inbox really made my day. I felt like a weight was lifted off my shoulders and that I was free to focus on work that really mattered.

Of course anyone can click Select All and then Delete and get the same result in just about any email program. What was different about Inbox was that of course there were some recent emails I couldn’t just simply sweep into the done pile. For those emails you just hit the snooze button. Wow, who doesn’t like to hit the snooze button?

IMG_4834

So the snooze button allows me to easily snooze an email till tomorrow morning or next Monday morning. Typically my emails fall into one of those categories, but if I want that email to pop up at a specific time I can pick a custom date and time as well.

The other brilliant option that I absolutely love is the ability to “put a pin” in an email. Once you pin an email you can’t accidentally sweep it into the done pile until you unpin it. So basically my email work flow is:

  • quickly scan my Inbox
  • ignore the 95% junk email
  • put a pin in anything that looks important
  • sweep everything else into the done pile with a single click
  • look at my pinned emails, snooze the stuff that can wait
  • deal with only the most important emails first

I’ve only been using Inbox for a few weeks, but I am hooked and I am loving my clean inbox.

But of course this really only addresses my email problem. There isn’t any integration in with Google calendars so I still don’t have the seamless workflow between email and calendaring that I had with my Outlook on Windows.

Enter the new Gmail interface…

2018-07-27_01-44-46

Admittedly, I am a little late to the game here. Apparently it became available to regular gmail.com user accounts back on April 25th. I don’t use a regular Gmail account, so I was just generally unaware of the change until I just stumbled upon it a few hours ago.

The Accidentally Google Admin

As any IT person will tell you, once people know you are tech savvy you tend to get roped into a lot of things. In my case it is the local marching band parent organization. Just recently I found myself being recruited to take over their social media accounts, WordPress website and Mailchimp email database. Although they owned their own domain for the website, they were not using it for email addresses.

Long story short, I got them signed up with a free G Suite for Nonprofits account since they had already embraced Google docs for some document sharing. It was pretty easy to get them up and running. Of course one of the first things I tried was to connect to my account with Google Inbox. The message I got back was pretty clear…the Administrator needs to enable Google Inbox first.

Seeing as I’m the Google Admin, I put my Google search skills to work and quickly found the page in the Google Admin site where I could allow Google Inbox.

2018-07-27_02-00-26

As I enabled Inbox my eyes wondered down to the next setting…New Gmail.

“What on earth is New Gmail?”

I decided to throw caution to the wind and went ahead and selected “Allow my users access to the new Gmail UI”.

Apparently this feature was just GA’d for G Suite users, so my timing was fortunate.

As I opened mail.google.com and logged in with my marching band account I was glad to see at the top of the Settings menu I was able to “Try the new Band Parents Email”. Without hesitating I click it and made the switch.

2018-07-27_02-08-50

Now I’ve only been using the new Gmail for a few hours but I think I’m going to be very happy with it. I was super happy to see that there is a SNOOZE button I can use. They don’t appear to have the pin and sweep option, but you can simulate that functionality by putting a star next to important emails and then Select Unstarred and then just delete the rest. It’s still easier in Inbox, and this isn’t really new functionality for Gmail. Hopefully Google will adopt the pin and sweep method of clearing emails in the new Gmail interface soon.

2018-07-27_02-23-00

But the key feature I really like is the tighter calendar integration. The one thing I love about Outlook is that from an email message you can schedule a meeting and everyone addressed in the email will automatically be part of the meeting invite, which opens up in the calendaring section of Outlook.

At first glance I thought this feature was still missing. However, after a little Google searching I discovered that it is in there.

2018-07-27_02-30-29

In fact, it had been there in the old interface as well, but I wasn’t seeing it because I enable a Vertical Split so I could view my email like I was accustomed to in Outlook. In the new Gmail this is a standard feature you can enable. In the old email it was in Google Labs.

2018-07-27_02-34-01

What I have come to find out though is that if you enable a Vertical or Horizontal Split you lose the ability to create an event directly from an email. That is really disappointing because in my mind those are two really key features. Hopefully someone will figure out how to get those features working together.

One feature that looks really interesting in Nudging. I think I will have to use it for a few days to experience it first hand, but as described Gmail will remind you about emails that may have fallen through the cracks and also about sent emails that you haven’t received response from yet.

I’m really most curious about that second use case as I’m sure many of you are. How is Google going to know what emails I expect to receive responses from? I’m assuming there is a bit of AI going on behind the scenes, but I’m sure I’ll learn more in the next week or two and will come back with some updates.

In the meantime, get your Google admin to enable the new Gmail and read about the rest of the new features here: https://support.google.com/mail/answer/7677724

 

 

First impression of the new Gmail Interface

SQL Server 2017 on Linux Availability Group Split Brain Problem

On July 18th, 2018 Microsoft published this support article with some guidance to help avoid Split Brain when using Availability Groups with SQL Server on Linux.

https://support.microsoft.com/en-us/help/4341219/split-brain-occurs-after-failover-when-using-alwayson-ags-with-externa

Running SQL Server on Linux can have some advantages, including cost savings on the OS if running in Azure. Run the numbers yourself, as the number of cores go up your cost savings year over year can be substantial, considering you are licensing at least two servers for every cluster pair.

https://azure.microsoft.com/en-us/pricing/calculator/

However, why bother saving money if the technology is not rock solid? One of the biggest issues I see with running SQL Server on Linux is the lack of a cohesive HA/DR story. On Windows, Microsoft owns the whole HA stack and SQL Server relies heavily on Windows Server Failover Clustering to support both Availability Groups and Failover Cluster Instances. This has been running well for many years and has a long track record of success stories.

When moving to Linux, Microsoft no longer owns the HA stack at the OS level and depending upon your distro of Linux, you are left trying to piece together open source solutions like Pacemaker, trying to get things to cooperate with SQL Server Availability Groups.

While you may eventually get it to work, I would much rather look to a 3rd party high availability solution like the SIOS Protection Suite for Linux (SPS-L), giving you a tried and true HA solution for your business critical applications running on Linux.

Azure-Linux-SQLServer.png
SQL Server on Linux Cluster in Azure

SPS-L has been protecting business critical applications running on Linux since 1999. It is a full HA/DR solution that monitors and recovers the entire application stack as well as the physical servers and network to ensure your business critical applications are highly available while also maintaining a 3rd copy for disaster recover in a remote datacenter or different geographic region of the cloud.

The other benefit of SPS-L is that it doesn’t require the Enterprise Edition of SQL Server, so there can be a significant cost savings advantage on SQL Server licenses as well. If you consider SQL Server Standard Edition costs $1859 per core vs $7128 per core for SQL Server Enterprise Edition, the cost savings advantage can be significant, depending upon how many cores you need to license.

Below is a video demonstration of SPS-L protecting SQL Server running on Linux in the Azure Cloud. The demonstration shows a SQL Server Standard Edition Cluster being manually failed over between nodes in different Azure Fault Domains as well as SPS-L responding to an unexpected failure.

 

 

SQL Server 2017 on Linux Availability Group Split Brain Problem