TICK TOCK…6 MONTHS UNTIL SQL SERVER 2008/2008 R2 SUPPORT EXPIRES UNLESS YOU TAKE ACTION

If you are still running SQL Server 2008/2008 R2 you probably have heard by now that as of July 9, 2019, you will no longer be supported. However, realizing that there are still a significant number of customers running on this platform that will not be able to upgrade to a newer version of SQL before that deadline, Microsoft has offered two options to provide extended security updates for an additional three years.

The first option you have requires the annual purchase of “Extended Security Updates”. Extended Security Updates will cost 75% of the full license cost annually and also requires that the customer is on active software assurance, which is typically 25% of the license cost annually. So effectively, to receive Extended Security Updates you are paying for new SQL Server licenses annually for three years, or until you migrate off SQL Server 2008/2008 R2.

However, there is another second option. Microsoft has announced that if you move your SQL Server 2008 R2 instances to Azure, you will receive the Extended Security Updates at no additional charge. There is of course the hourly infrastructure charges you will incur in Azure, plus either the cost of pay as you go SQL Server instances or the Software Assurance charges if you want to bring your existing SQL licenses to Azure, but that cost includes the added benefit of running in a state of the art cloud environment which opens up opportunities for enhanced performance and HA/DR scenarios that you may not have had available on premise.

Azure offers many different options in terms of CPU, Memory and Storage configurations. If you are looking for a server or storage upgrade, or your existing on-premise infrastructure was reaching a refresh cycle, now is the perfect time to dip your feet into the Azure cloud and upgrade your performance and availability at the same time as extending the life of your SQL Server 2008/2008 R2 deployment.

In terms of high availability and disaster recovery configurations, Azure offers up to a 99.99% SLA.  To qualify for the SLA you must leveraging their infrastructure appropriately and even then, the SLA only covers “dial tone” to the instance. It is up to you to ensure SQL Server is highly available, which is traditionally done by building a SQL Server Failover Cluster Instance (FCI). Azure has the infrastructure in place which enables you to configure a SQL Server FCI, but due to the lack of cluster aware shared storage in the cloud, you will need to use SIOS DataKeeper to build the FCI. I recently wrote a Step-by-Step guide to help you with the process. Step-by-Step: How to configure a SQL Server 2008 R2 Failover Cluster Instance in Azure

SIOS DataKeeper takes the place of the shared storage normally required by a SQL Server FCI and instead allows you to leverage the any NTFS formatted volumes that are attached to each instance. SIOS keeps the volumes replicated between the instances and presents the storage to the cluster as a resource called a DataKeeper Volume. As far as the cluster is concerned the DataKeeper Volume looks like a share disk, but instead of controlling SCSI reservations (disk locking), it controls the mirror direction ensuring writes occur on the active server and are synchronously or asynchronously replicated to the other cluster nodes. The end user experience is exactly the same as a traditional shared storage cluster, but under the covers the cluster is leveraging the locally attached storage instead of shared storage.

In Azure your cluster nodes can run in different racks (Fault Domains), data centers (Availability Zones), or even in different geographic regions. SIOS DataKeeper supports all three options: Fault Domains, Availability Zones or cross Region replication to cover both HA and DR requirements. Similar configurations are also possible in the AWS and Google Cloud.

azure ha
Typical 2-node SQL Server FCI configuration in Azure with SIOS DataKeeper

With Azure Site Recovery (ASR) you can replicate standalone or clustered instances of SQL Server between Region Pairs, without the headache and expense of managing your own disaster recovery site. And of course SQL Server seldom lives alone, so at the same time you move your SQL Server instance to Azure you probably want to move your application servers there as well to also take advantage of the performance and availability upgrades available in Azure. Combining SIOS DataKeeper for HA and ASR for DR provides a cost effective HA and DR strategy that would have been impossible, or extremely expensive to implement on premise with SAN replication and your own DR site.

asr - 2
Common configuration leveraging SIOS DataKeeper for HA and Azure Site Recovery for DR

While it only takes a few minutes to spin up a SQL Server instance in Azure, I wouldn’t wait until the last minute to do your migration. Please take the next few months to become familiar with Azure, start doing some testing, and then plan to migrate your workloads well before the July 9, 2019 expiration date. Running SQL Server after that date leaves you susceptible to any new security threats and also puts you out of compliance. Your boss, and more importantly your customers, will be glad to know that their data is still secure, available, and in compliance once you migrate your workload to Azure.

TICK TOCK…6 MONTHS UNTIL SQL SERVER 2008/2008 R2 SUPPORT EXPIRES UNLESS YOU TAKE ACTION

Step-by-Step: How to Trigger an Email Alert when a Specific Windows Service Starts or Stops on Windows Server 2016

Introduction

In my last post. Step-by-Step: How to Trigger an Email Alert from a Windows Event that Includes the Event Details using Windows Server 2016, I showed you how to send an email alert based upon specific Windows EventIDs being logged in a Windows Event Log. While that works great for most events it is not ideal if you want to be notified when a specific Windows Service starts or stops.

When a Windows Service starts or stops an EventID 7036 from the Source “Service Control Manager” is logged in the Windows System Log. Now we could simply set up a trigger to send an email whenever that EventID is logged as I described in my previous post, however you might not want to receive an email when EVERY Windows Service starts or stops.

To get a little more specific we will have to edit the XML data associated with the Windows Event Filter when we set up the trigger to look a little deeper at the Event Properties and filter on the EventData that is only shown when you view the XML View on the Details tab of a Windows Event.

This work was verified on Windows Server 2016, but I suspect it should work on Windows Server 2012 R2 and Windows Server 2019 as well. If you get it working on any other platforms please comment and let us know if you had to change anything.

Step 1 – Write a Powershell Script

The first thing that you need to do is write a Powershell script that when run can send an email. While researching this I discovered many ways to accomplish this task, so what I’m about to show you is just one way, but feel free to experiment and use what is right for your environment.

In my lab I do not run my own SMTP server, so I had to write a script that could leverage my Gmail account. You will see in my Powershell script the password to the email account that authenticates to the SMTP server is in plain text. If you are concerned that someone may have access to your script and discover your password then you will want to encrypt your credentials. Gmail requires and SSL connection so your password should be safe on the wire, just like any other email client.

Here is an example of a Powershell script that when used in conjunction with Task Scheduler which will send an email alert automatically when any specified Event is logged in the Windows Event Log. In my environment I saved this script to C:\Alerts\ServiceAlert.ps1

$filter="*[System[EventID=7036] and EventData[Data='SIOS DataKeeper']]"
$A = Get-WinEvent -LogName System -MaxEvents 1 -FilterXPath $filter
$Message = $A.Message
$EventID = $A.Id
$MachineName = $A.MachineName
$Source = $A.ProviderName


$EmailFrom = "sios@medfordband.com"
$EmailTo = "sios@medfordband.com"
$Subject ="Alert From $MachineName"
$Body = "EventID: $EventID`nSource: $Source`nMachineName: $MachineName `n$Message"
$SMTPServer = "smtp.gmail.com"
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 587)
$SMTPClient.EnableSsl = $true
$SMTPClient.Credentials = New-Object System.Net.NetworkCredential("sios@medfordband.com", "MySMTPP@55w0rd");
$SMTPClient.Send($EmailFrom, $EmailTo, $Subject, $Body)

An example of an email generated from that Powershell script looks like this.

Service Alert Email

You probably noticed that this Powershell script uses the Get-WinEvent cmdlet to grab the most recent Event Log entry based upon the LogName, EventID and EventData specified. It then parses that event and assigns the EventID, Source, MachineName and Message to variables that will be used to compose the email. You will see that the LogName, EventID and EventData specified is the same as what you will specify when you set up the Scheduled Task in Step 2.

While EventID, LogName are probably familiar to you, EventData may not be as familiar. To see the EventData associated with a particular Event you will need to open the Event in Event Viewer, look at the Details tab and then select XML view. From the XML view you can see all the data included with an event. Near the bottom of the XML you will see an array of data called <EventData>. Within there you will find additional Event Data stored as parameters. As show below, in the “param1” we will find the name of the Service being that either stopped or started.

Event Data

Step 2 – Set Up a Scheduled Task

In Task Scheduler Create a Task as show in the following screen shots.

  1. Create Task
    Create TaskMake sure the task is set to Run whether the user is logged on or not.
    Service - General
  2.  On the Triggers tab choose New to create a Trigger that will begin the task “On an Event”. In my example I will be creating an event that triggers any time DataKeeper (extmirr) logs an important event to the System log.
    Create Task 3
    Create a custom event and New Event Filter as shown below…

    Create Task - Trigger

    For my trigger you can start my setting up a trigger that monitors 7036 as I describe in my previous article. However, the Filter GUI interface does not allow us to specify the Service Name stored in Param1 of EventData as I described earlier. In order to monitor for just the specific service we are interested in we will need to edit the XML directly as shown below.

    Service - XML
    If you rather just skip straight to the chase feel free to copy my XML below and replace ‘SIOS DataKeeper’ with the event data stored in param1 of the Event you want to monitor.

    <QueryList>
    <Query Id="0" Path="System">
    <Select Path="System">*[System[(Level=4 or Level=0) and (EventID=7036)]] and *[EventData[Data[1]='SIOS DataKeeper']]</Select>
    </Query>
    </QueryList>
  3. Once the Event Trigger is configured, you will need to configure the Action that occurs when the event is run. In our case we are going to run the Powershell script that we created in Step 1.
    Actions - 2

    Service - Task

  4. The default Condition parameters should be sufficient.
    Conditions - 1
  5. And finally, on the Settings tab make sure you allow the task to be run on demand and to “Queue a new instance” if a task is already running.

    2018-10-28_00-17-27

Step 3 (if necessary) – Fix the Microsoft-Windows-DistributedCOM Event ID: 10016 Error

In theory, if you did everything correctly you should now start receiving emails any time one of the events you are monitoring gets logged in the event log.  However, I ran into a weird permission issue on one of my servers that I had to address before everything worked. I’m not sure if you will run into this issue, but just in case here is the fix.

In my case when I manually triggered the event, or if I ran the Powershell script directly, everything worked as expected and I received an email. However, if one of the EventIDs being monitored was logged into the event log it would not result in an email being sent. The only clue I had was the Event ID: 10016 that was logged in my Systems event log each time I expected the Task Trigger to detect a logged event.

Log Name: System
Source: Microsoft-Windows-DistributedCOM
Date: 10/27/2018 5:59:47 PM
Event ID: 10016
Task Category: None
Level: Error
Keywords: Classic
User: DATAKEEPER\dave
Computer: sql1.datakeeper.local
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID 
{D63B10C5-BB46-4990-A94F-E40B9D520160}
and APPID 
{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}
to the user DATAKEEPER\dave SID (S-1-5-21-25339xxxxx-208xxx580-6xxx06984-500) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.

Many of the Google search results for that error indicate that the error is benign and include instructions on how to suppress the error instead of fixing it. However, I was pretty sure this error was the cause of my current failure to be able to send an email alert from a Scheduled Event that was triggered from a monitored Event Log entry, so I needed to fix it.

After much searching, I stumbled upon this newsgroup discussion.  The response from Marc Whittlesey pointed me in the right direction. This is what he wrote…

There are 2 registry keys you have to set permissions before you go to the DCOM Configuration in Component services: CLSID key and APPID key.

I suggest you to follow some steps to fix issue:

1. Press Windows + R keys and type regedit and press Enter.
2. Go to HKEY_Classes_Root\CLSID\*CLSID*.
3. Right click on it then select permission.
4. Click Advance and change the owner to administrator. Also click the box that will appear below the owner line.
5. Apply full control.
6. Close the tab then go to HKEY_LocalMachine\Software\Classes\AppID\*APPID*.
7. Right click on it then select permission.
8. Click Advance and change the owner to administrators.
9. Click the box that will appear below the owner line.
10. Click Apply and grant full control to Administrators.
11. Close all tabs and go to Administrative tool.
12. Open component services.
13. Click Computer, click my computer, and then click DCOM.
14. Look for the corresponding service that appears on the error viewer.
15. Right click on it then click properties.
16. Click security tab then click Add User, Add System then apply.
17. Tick the Activate local box.

So use the relevant keys here and the DCOM Config should give you access to the greyed out areas:
CLSID {D63B10C5-BB46-4990-A94F-E40B9D520160}

APPID {9CA88EE3-ACB7-47C8-AFC4-AB702511C276}

I was able to follow Steps 1-15 pretty much verbatim. However, when I got to Step 16 I really couldn’t tell exactly what he wanted me to do. At first I granted the DATAKEEPER\dave user account Full Control to the RuntimeBroker, but that didn’t fix things. Eventually I just selected “Use Default” on all three permissions and that fixed the issue.

RuntimeBroker
I’m not sure how or why this happened, but I figured I better write it all down in case it happens again because it took me a while to figure it out.

Step 4 – Automating the Deployment

If you need to enable the same alerts on multiple systems you can simply export your Task to an XML file and Import it on your other systems.

ExportImport

Or even better yet, automate the Import as part of your build process through a Powershell script after making your XML file available on a file share as shown in the following example.

PS C:\> Register-ScheduledTask -Xml (get-content '\\myfileshare\tasks\DataKeeperAlerts.xml' | out-string) -TaskName "DataKeeper Service Alerts" -User datakeeper\dave -Password MyDomainP@55W0rd –Force

 

In Summary

Hopefully what I have provided will give you everything you need to start receiving alert notification emails on whichever Windows Services keep you up at night.

This concludes my series on configuring email alerts. In this series I covered covered configuring alerts based on Perfmon counters, Event Log Entries and in this article Windows Service Start and Stop events. Of course you can extend these Powershell scripts described in these articles to do more than just send emails. Many alerts or unexpected service stoppages generally require some remediation, so why not just script out the recovery steps and let the triggered task take care of the issue for you?

Personally I recommend that you invest in SCOM , SolarWinds or some other Enterprise Management System, but if that is not in the cards where you work then these articles can help in a pinch.

Step-by-Step: How to Trigger an Email Alert when a Specific Windows Service Starts or Stops on Windows Server 2016

Step-by-Step: How to Trigger an Email Alert from a Windows Event that Includes the Event Details using Windows Server 2016

Introduction

Setting up an email alert is as simple as creating a Windows Task that is triggered by an Event. You then must specify the action that will occur when that Task is triggered. Since Microsoft has decided to deprecate the “Send an e-mail” option the only choice we have is to Start a Program. In our case that program will be a Powershell script that will collect the Event Log information and parse it so that we can send an email that includes important Log Event details.

This work was verified on Windows Server 2016, but I suspect it should work on Windows Server 2012 R2 and Windows Server 2019 as well. If you get it working on any other platforms please comment and let us know if you had to change anything.

Step 1 – Write a Powershell Script

The first thing that you need to do is write a Powershell script that when run can send an email. While researching this I discovered many ways to accomplish this task, so what I’m about to show you is just one way, but feel free to experiment and use what is right for your environment.

In my lab I do not run my own SMTP server, so I had to write a script that could leverage my Gmail account. You will see in my Powershell script the password to the email account that authenticates to the SMTP server is in plain text. If you are concerned that someone may have access to your script and discover your password then you will want to encrypt your credentials. Gmail requires and SSL connection so your password should be safe on the wire, just like any other email client.

Here is an example of a Powershell script that when used in conjunction with Task Scheduler which will send an email alert automatically when any specified Event is logged in the Windows Event Log. In my environment I saved this script to C:\Alerts\DataKeeper.ps1

$EventId = 16,20,23,150,219,220

$A = Get-WinEvent -MaxEvents 1  -FilterHashTable @{Logname = "System" ; ID = $EventId}
$Message = $A.Message
$EventID = $A.Id
$MachineName = $A.MachineName
$Source = $A.ProviderName


$EmailFrom = "sios@medfordband.com"
$EmailTo = "sios@medfordband.com"
$Subject ="Alert From $MachineName"
$Body = "EventID: $EventID`nSource: $Source`nMachineName: $MachineName `nMessage: $Message"
$SMTPServer = "smtp.gmail.com"
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 587)
$SMTPClient.EnableSsl = $true
$SMTPClient.Credentials = New-Object System.Net.NetworkCredential("sios@medfordband.com", "mySMTPP@55w0rd");
$SMTPClient.Send($EmailFrom, $EmailTo, $Subject, $Body)

An example of an email generated from that Powershell script looks like this.

Email1

You probably noticed that this Powershell script uses the Get-WinEvent cmdlet to grab the most recent Event Log entry based upon the LogName, Source and eventIDs specified. It then parses that event and assigns the EventID, Source, MachineName and Message to variables that will be used to compose the email. You will see that the LogName, Source and eventIDs specified are the same as the ones you will specify when you set up the Scheduled Task in Step 2.

Step 2 – Set Up a Scheduled Task

In Task Scheduler Create a Task as show in the following screen shots.

  1. Create Task
    Create Task

    Make sure the task is set to Run whether the user is logged on or not.
    DataKeeper Alerts

  2.  On the Triggers tab choose New to create a Trigger that will begin the task “On an Event”. In my example I will be creating an event that triggers any time DataKeeper (extmirr) logs an important event to the System log.
    Create Task 3
    Create a custom event and New Event Filter as shown below…

    Create Task - Trigger

    For my trigger I am triggering on commonly monitored SIOS DataKeeper (ExtMirr) EventIDs 16, 20, 23,150,219,220 . You will need to set up your event to trigger on the specific Events that you want to monitor. You can put multiple Triggers in the same Task if you want to be notified about events that come from different logs or sources.

    Edit Event Filter
    Create a New Event Filter

     

  3. Once the Event Trigger is configured, you will need to configure the Action that occurs when the event is run. In our case we are going to run the Powershell script that we created in Step 1.
    Actions - 2

    Edit - Actions

  4. The default Condition parameters should be sufficient.
    Conditions - 1
  5. And finally, on the Settings tab make sure you allow the task to be run on demand and to “Queue a new instance” if a task is already running.

    2018-10-28_00-17-27

Step 3 (if necessary) – Fix the Microsoft-Windows-DistributedCOM Event ID: 10016 Error

In theory, if you did everything correctly you should now start receiving emails any time one of the events you are monitoring gets logged in the event log.  However, I ran into a weird permission issue on one of my servers that I had to address before everything worked. I’m not sure if you will run into this issue, but just in case here is the fix.

In my case when I manually triggered the event, or if I ran the Powershell script directly, everything worked as expected and I received an email. However, if one of the EventIDs being monitored was logged into the event log it would not result in an email being sent. The only clue I had was the Event ID: 10016 that was logged in my Systems event log each time I expected the Task Trigger to detect a logged event.

Log Name: System
Source: Microsoft-Windows-DistributedCOM
Date: 10/27/2018 5:59:47 PM
Event ID: 10016
Task Category: None
Level: Error
Keywords: Classic
User: DATAKEEPER\dave
Computer: sql1.datakeeper.local
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID 
{D63B10C5-BB46-4990-A94F-E40B9D520160}
and APPID 
{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}
to the user DATAKEEPER\dave SID (S-1-5-21-25339xxxxx-208xxx580-6xxx06984-500) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.

Many of the Google search results for that error indicate that the error is benign and include instructions on how to suppress the error instead of fixing it. However, I was pretty sure this error was the cause of my current failure to be able to send an email alert from a Scheduled Event that was triggered from a monitored Event Log entry, so I needed to fix it.

After much searching, I stumbled upon this newsgroup discussion.  The response from Marc Whittlesey pointed me in the right direction. This is what he wrote…

There are 2 registry keys you have to set permissions before you go to the DCOM Configuration in Component services: CLSID key and APPID key.

I suggest you to follow some steps to fix issue:

1. Press Windows + R keys and type regedit and press Enter.
2. Go to HKEY_Classes_Root\CLSID\*CLSID*.
3. Right click on it then select permission.
4. Click Advance and change the owner to administrator. Also click the box that will appear below the owner line.
5. Apply full control.
6. Close the tab then go to HKEY_LocalMachine\Software\Classes\AppID\*APPID*.
7. Right click on it then select permission.
8. Click Advance and change the owner to administrators.
9. Click the box that will appear below the owner line.
10. Click Apply and grant full control to Administrators.
11. Close all tabs and go to Administrative tool.
12. Open component services.
13. Click Computer, click my computer, and then click DCOM.
14. Look for the corresponding service that appears on the error viewer.
15. Right click on it then click properties.
16. Click security tab then click Add User, Add System then apply.
17. Tick the Activate local box.

So use the relevant keys here and the DCOM Config should give you access to the greyed out areas:
CLSID {D63B10C5-BB46-4990-A94F-E40B9D520160}

APPID {9CA88EE3-ACB7-47C8-AFC4-AB702511C276}

I was able to follow Steps 1-15 pretty much verbatim. However, when I got to Step 16 I really couldn’t tell exactly what he wanted me to do. At first I granted the DATAKEEPER\dave user account Full Control to the RuntimeBroker, but that didn’t fix things. Eventually I just selected “Use Default” on all three permissions and that fixed the issue.

RuntimeBroker
I’m not sure how or why this happened, but I figured I better write it all down in case it happens again because it took me a while to figure it out.

Step 4 – Automating the Deployment

If you need to enable the same alerts on multiple systems you can simply export your Task to an XML file and Import it on your other systems.

ExportImport

Or even better yet, automate the Import as part of your build process through a Powershell script after making your XML file available on a file share as shown in the following example.

PS C:\> Register-ScheduledTask -Xml (get-content '\\myfileshare\tasks\DataKeeperAlerts.xml' | out-string) -TaskName "DataKeeperAlerts" -User datakeeper\dave -Password MyDomainP@55W0rd –Force

 

In Summary

Hopefully what I have provided will give you everything you need to start receiving alert notification emails on whichever Event Log entries keep you up at night.

In my next post I will show you how to be notified when a specified Service either starts or stops. Of course you could just monitor for EventID 7036 from Service Control Monitor, but that would notify you whenever ANY service starts or stops. We will need to dig a little deeper to make sure we get notified only when the services we care about start or stop.

Step-by-Step: How to Trigger an Email Alert from a Windows Event that Includes the Event Details using Windows Server 2016

Moving SQL Server 2008 and 2008 R2 clusters to #Azure for Extended Support

Earlier this year Microsoft announced extended support for SQL Server 2008 and 2008 R2 at no additional cost. However, the catch is that you must migrate your SQL Server installation to Azure in order to take advantage of the extended support. For all the details, check out https://www.microsoft.com/en-us/sql-server/sql-server-2008. If you choose not to move, your extended support ends on July 9th, 2019, just about 9 months from now.

2018-10-05_16-45-37

Chances are if you are still running SQL Server 2008 R2 it’s simply because you never upgraded your application, so newer versions of SQL are not supported. Or you simply decided not to fix what isn’t broken. Regardless of these reason, you have just bought yourself another three years of support, if you migrate to Azure.

Now migrating workloads to Azure is a pretty well documented procedure, using Azure Site Recovery, so that process should be pretty seamless for you for your standalone instances of SQL Server.

But what about those clustered instances of SQL Server? You certainly don’t want to give up availability when you move to the Azure. Part of the beauty of Azure is that they have infrastructure that you can only dream of. However, it is incumbent upon the user to configure their applications to take full advantage of the infrastructure to ensure that your deployments are highly available.

With SQL Server 2008 and 2008 R2, high availability commonly means SQL Server Failover Clustering on either Windows Server 2008 R2 or Windows Server 2012 R2. If you are new to Azure you will quickly discover that there is no native option that supports  shared storage clusters. Instead, you will need to look at a SANLess cluster solution such as SIOS DataKeeper. Microsoft list SIOS DataKeeper as the HA solution for SQL Server Failover CLustering in their documentation.

2018-10-05_16-59-39
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sql/virtual-machines-windows-sql-high-availability-dr

In order to facilitate a simple migration of your existing SQl Server 2008 or 2008 R2 cluster to Azure here are the high level steps you will need to take.

  • Replace the Physical Disk Resource in your existing on premise SQL Server cluster with a DataKeeper Volume Resource. Do the same for MSDTC resources if you use MSDTC.
  • Remove your Disk Witness and replace it with a File Share Witness
  • Use Azure Site Recovery to replicate your cluster nodes into Azure, making sure each replicated node resides in a different Fault Domain or in different Availability Zones in Azure
  • Recovery your replicate cluster nodes in Azure
  • Replace the File Share Witness with a File Share hosted in Azure
  • Configure the Internal Load Balancer in Azure for client redirection, this includes running the Powershell script on the local nodes to update the SQL Cluster IP resource to listen for the ILB probe
  • Assuming the IP addresses and subnet of the SQL Server cluster instances changed as part of this migration you will also need to do some cleanup of the cluster IP address and the DataKeeper job endpoints to reflect the new IP addresses

I know I left out a lot of the details, but if you find yourself in the position of having to do a lift and shift of SQL Server to Azure, or any cloud for that matter, I’d be glad to get on the phone with you to answer any questions you may have. Keep in mind, the same steps apply for any version of SQL that you plan to migrate to Azure.

2/14/19 UPDATE: I published a detailed Step-by-Step Guide for Cluster SQL Server 2008 R2 on Azure

Moving SQL Server 2008 and 2008 R2 clusters to #Azure for Extended Support

#Ignite2018 Session: Ensure application availability with cloud-based disaster recovery, Azure Site Recovery #SAP #BusinessContinuity

I’m a big fan of Azure Site Recovery for Disaster Recovery and was glad to attend the Ignite session today presented by Rochak Mittal and Ashish Gangwar

BRK3304 – Architecting mission-critical, high-performance SAP workloads on Azure

In one of the architecture slides they showed how an entire SAP deployment could be protected by Azure Site Recovery (ASR) and recovered in the event of a disaster in just a few minutes. Using Azure Recovery Plans allows you to have explicit control over recovery, including creating dependencies on resources as well as invoking scripts within a VM to help facilitate the complete recovery.

It seems like yesterday, but it was back in May of 2014 when I first started assisting Microsoft with providing a HA solution for SAP ASCS in Azure. That solution involves using DataKeeper to build a SANless cluster solution for ASCS and still stands today as the only HA solution that also works with ASR for disaster recovery configurations such as the one shown in this demo at Ignite.

1002-wsfc-sios-on-azure-ilb.png
Shared disks in Azure with SIOS DataKeeper

If you need help planning your highly available SAP deployment is Azure definitely reach out to me, I’d be glad to assist.

 

 

#Ignite2018 Session: Ensure application availability with cloud-based disaster recovery, Azure Site Recovery #SAP #BusinessContinuity

Azure Outage Post-Mortem Part 3

My previous blog posts, Azure Outage Post-Mortem – Part 1 and Azure Outage Post-Mortem Part 2,made some assumptions based upon limited information coming from blog posts and twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime tomorrow you should be able to view the session for yourself.

BRK3075 – Preparing for the unexpected: Anatomy of an Azure outage

The official Root Cause Analysis they said will be published soon, but in the meantime here are some tidbits of information gleaned from the session.

The outage was NOT caused by a lightning strike as previously reported. Instead, due to the nature of the storm there were electrical storm sags and swells, which locked out a chiller plant in the 1st datacenter. During this first outage they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter which was not recovered properly, which began an unfortunate series of events.

During this 2nd outage, Microsoft states that “Engineers didn’t triage alerts correctly – chiller plant recovery was not prioritized”. There were numerous alerts being triggered at this time, and unfortunately the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.

Microsoft states that of course redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested, so it was set to manual mode until testing had been completed.

After 45 minutes the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire, and staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing and some hardware was not shut down properly, causing damage to some storage and networking.

After manually resetting the chillers and opening the air handlers the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.

The biggest issue was there was damage to storage. Microsoft’s primary concern is data protection, so short of the enter datacenter sinking into a sinkhole or a meteor strike taking out the datacenter, Microsoft will work to recover data to ensure no data loss. This of course took some time, which extend the overall length of the outage. The good news is that no customer data was lost, the bad news is that it seemed like it took 24-48 hours for things to return to normal, based upon what I read on Twitter from customers complaining about the prolonged outage.

Everyone expected that this outage would impact customers hosted in the South Central Region, but what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.

Azure Service Manager (ASM) – This controls Azure “Classic” resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn’t clear to me why this happened, but it appears that South Central Region hosts some important components of that service which became unavailable.

Visual Studio Team Service (VSTS) – Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.

Postmortem: VSTS 4 September 2018

Azure Active Directory (AAD) – When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling, but the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic, and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.

They ran out of time to discuss this further during the Ignite session, but one feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO) the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center should Microsoft experience another extended outage in the future.

Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper Azure Site Recovery, or application specific replication solutions which give you the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.

 

 

Azure Outage Post-Mortem Part 3

Azure Outage Post-Mortem Part 2

My previous blog post says that Cloud-to-Cloud or Hybrid-Cloud would give you the most isolation from just about any issue a CSP could encounter. However, in this particular failure had Availability Zones been available in the South Central region most of the downtime caused by this natural disaster could have been avoided. Microsoft published a Preliminary RCA of the September 4th South Central Outage.

The most important part of that whole summary is as follows…

“Despite onsite redundancies, there are scenarios in which a datacenter cooling failure can impact customer workloads in the affected datacenter.”

What does that mean to you? If your applications all run in the same datacenter you are susceptible to the same type of outage in the future. In Microsoft’s defense, this really shouldn’t be news to you as this has always been true whether you run in Azure, AWS, Google or even your own datacenter. Failure to plan ahead with data replication to a different datacenter and a plan in place to quickly recover your applications in those datacenters in the event of a disaster is simply a lack of planning on your part.

While Microsoft doesn’t publish exact Availability Zone locations, if you believe this map published here you could guess that they are probably anywhere from a 2-10 miles apart from each other.

Azure Datacenters.png

In all but the most extreme cases, replicating data across Availability Zones should be sufficient for data protection. Some applications such as SQL Server have built in replication technology, but for a broad range of applications, operating systems and data types you will want to investigate block level replication SANless cluster solutions. SANless cluster solutions have traditionally been used for multisite clusters, but the same technology can also be used in the cloud across Availability Zones, Regions, or Hybrid-Cloud for high availability and disaster recovery.

Implementing a SANless cluster that spans Availability Zones, whether it is Azure, AWS or Google, is a pretty simple process given the right tools. Here are a few resources to help get you started.

Step-by-Step: Configuring a File Server Cluster in Azure that Spans Availability Zones

How to Build a SANless SQL Server Failover Cluster Instance in Google Cloud Platform

MS SQL Server v.Next on Linux with Replication and High Availability #Azure #Cloud #Linux

Deploying Microsoft SQL Server 2014 Failover Clusters in #Azure Resource Manager (ARM)

SANless SQL Server Clusters in AWS

SANless Linux Cluster in AWS Quick Start

If you are in Azure you may also want to consider Azure Site Recovery (ASR). ASR lets you replicate the entire VM from one Azure region to another region. ASR will replicate your VMs in real-time and allow you to do a non-disruptive DR test whenever you like. It supports most versions of Windows and Linux and is relatively easy to set up.

You can also create replication jobs that have “Multi-VM Consistency”, meaning that servers that must be recovered from the exact same point in time can be put together in this consistency group and they will have the exact same recovery point. What this means is if you wanted to build a SANless cluster with DataKeeper in a single region for high availability you have two options for DR. One is you could extend your SANless cluster to a node in a different region, or else you could simply use ASR to replicate both nodes in a consistency group.

asr

The trade off with ASR is that the RPO and RTO is not as good as you will get with a SANless multi-site cluster, but it is easy to configure and works with just about any application. Just be careful, if your application exceeds 10 MBps in disk write activity on a regular basis ASR will not be able to keep up. Also, clusters based on Storage Spaces Direct cannot be replicated with ASR and in general lack a good DR strategy when used in Azure.

For a while after Managed Disks were released ASR did not fully support them until about a year later. Full support for Managed Disks was a big hurdle for many people looking to use ASR. Fortunately since about February of 2018 ASR fully supports Managed Disks. However, there is another problem that was just introduced.

With the introduction of Availability Zones ASR is once again caught behind the times as they currently don’t support VMs that have been deployed in Availability Zones.

2018-09-25_00-10-24
Support matrix for replicating from one Azure region to another

I went ahead and tried it anyway. I seemed to be able to configure replication and was able to do a test failover.

ASR-and-AZ
I used ASR to replicate SQL1 and SQL3 from Central to East US 2 and did a test failover. Other than not placing the VMs in AZs in East US 2 it seems to work.

I’m hoping to find out more about this limitation at the Ignite conference. I don’t think this limitation is as critical as the Managed Disk limitation was, just because Availability Zones aren’t widely available yet. So hopefully ASR will pick up support for Availability Zones as other regions light up Availability Zones and they are more widely adopted.

 

 

Azure Outage Post-Mortem Part 2