Step-by-Step: How to Trigger an Email Alert when a Specific Windows Service Starts or Stops on Windows Server 2016

Introduction

In my last post. Step-by-Step: How to Trigger an Email Alert from a Windows Event that Includes the Event Details using Windows Server 2016, I showed you how to send an email alert based upon specific Windows EventIDs being logged in a Windows Event Log. While that works great for most events it is not ideal if you want to be notified when a specific Windows Service starts or stops.

When a Windows Service starts or stops an EventID 7036 from the Source “Service Control Manager” is logged in the Windows System Log. Now we could simply set up a trigger to send an email whenever that EventID is logged as I described in my previous post, however you might not want to receive an email when EVERY Windows Service starts or stops.

To get a little more specific we will have to edit the XML data associated with the Windows Event Filter when we set up the trigger to look a little deeper at the Event Properties and filter on the EventData that is only shown when you view the XML View on the Details tab of a Windows Event.

This work was verified on Windows Server 2016, but I suspect it should work on Windows Server 2012 R2 and Windows Server 2019 as well. If you get it working on any other platforms please comment and let us know if you had to change anything.

Step 1 – Write a Powershell Script

The first thing that you need to do is write a Powershell script that when run can send an email. While researching this I discovered many ways to accomplish this task, so what I’m about to show you is just one way, but feel free to experiment and use what is right for your environment.

In my lab I do not run my own SMTP server, so I had to write a script that could leverage my Gmail account. You will see in my Powershell script the password to the email account that authenticates to the SMTP server is in plain text. If you are concerned that someone may have access to your script and discover your password then you will want to encrypt your credentials. Gmail requires and SSL connection so your password should be safe on the wire, just like any other email client.

Here is an example of a Powershell script that when used in conjunction with Task Scheduler which will send an email alert automatically when any specified Event is logged in the Windows Event Log. In my environment I saved this script to C:\Alerts\ServiceAlert.ps1

$filter="*[System[EventID=7036] and EventData[Data='SIOS DataKeeper']]"
$A = Get-WinEvent -LogName System -MaxEvents 1 -FilterXPath $filter
$Message = $A.Message
$EventID = $A.Id
$MachineName = $A.MachineName
$Source = $A.ProviderName


$EmailFrom = "sios@medfordband.com"
$EmailTo = "sios@medfordband.com"
$Subject ="Alert From $MachineName"
$Body = "EventID: $EventID`nSource: $Source`nMachineName: $MachineName `n$Message"
$SMTPServer = "smtp.gmail.com"
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 587)
$SMTPClient.EnableSsl = $true
$SMTPClient.Credentials = New-Object System.Net.NetworkCredential("sios@medfordband.com", "MySMTPP@55w0rd");
$SMTPClient.Send($EmailFrom, $EmailTo, $Subject, $Body)

An example of an email generated from that Powershell script looks like this.

Service Alert Email

You probably noticed that this Powershell script uses the Get-WinEvent cmdlet to grab the most recent Event Log entry based upon the LogName, EventID and EventData specified. It then parses that event and assigns the EventID, Source, MachineName and Message to variables that will be used to compose the email. You will see that the LogName, EventID and EventData specified is the same as what you will specify when you set up the Scheduled Task in Step 2.

While EventID, LogName are probably familiar to you, EventData may not be as familiar. To see the EventData associated with a particular Event you will need to open the Event in Event Viewer, look at the Details tab and then select XML view. From the XML view you can see all the data included with an event. Near the bottom of the XML you will see an array of data called <EventData>. Within there you will find additional Event Data stored as parameters. As show below, in the “param1” we will find the name of the Service being that either stopped or started.

Event Data

Step 2 – Set Up a Scheduled Task

In Task Scheduler Create a Task as show in the following screen shots.

  1. Create Task
    Create TaskMake sure the task is set to Run whether the user is logged on or not.
    Service - General
  2.  On the Triggers tab choose New to create a Trigger that will begin the task “On an Event”. In my example I will be creating an event that triggers any time DataKeeper (extmirr) logs an important event to the System log.
    Create Task 3
    Create a custom event and New Event Filter as shown below…

    Create Task - Trigger

    For my trigger you can start my setting up a trigger that monitors 7036 as I describe in my previous article. However, the Filter GUI interface does not allow us to specify the Service Name stored in Param1 of EventData as I described earlier. In order to monitor for just the specific service we are interested in we will need to edit the XML directly as shown below.

    Service - XML
    If you rather just skip straight to the chase feel free to copy my XML below and replace ‘SIOS DataKeeper’ with the event data stored in param1 of the Event you want to monitor.

    <QueryList>
    <Query Id="0" Path="System">
    <Select Path="System">*[System[(Level=4 or Level=0) and (EventID=7036)]] and *[EventData[Data[1]='SIOS DataKeeper']]</Select>
    </Query>
    </QueryList>
  3. Once the Event Trigger is configured, you will need to configure the Action that occurs when the event is run. In our case we are going to run the Powershell script that we created in Step 1.
    Actions - 2

    Service - Task

  4. The default Condition parameters should be sufficient.
    Conditions - 1
  5. And finally, on the Settings tab make sure you allow the task to be run on demand and to “Queue a new instance” if a task is already running.

    2018-10-28_00-17-27

Step 3 (if necessary) – Fix the Microsoft-Windows-DistributedCOM Event ID: 10016 Error

In theory, if you did everything correctly you should now start receiving emails any time one of the events you are monitoring gets logged in the event log.  However, I ran into a weird permission issue on one of my servers that I had to address before everything worked. I’m not sure if you will run into this issue, but just in case here is the fix.

In my case when I manually triggered the event, or if I ran the Powershell script directly, everything worked as expected and I received an email. However, if one of the EventIDs being monitored was logged into the event log it would not result in an email being sent. The only clue I had was the Event ID: 10016 that was logged in my Systems event log each time I expected the Task Trigger to detect a logged event.

Log Name: System
Source: Microsoft-Windows-DistributedCOM
Date: 10/27/2018 5:59:47 PM
Event ID: 10016
Task Category: None
Level: Error
Keywords: Classic
User: DATAKEEPER\dave
Computer: sql1.datakeeper.local
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID 
{D63B10C5-BB46-4990-A94F-E40B9D520160}
and APPID 
{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}
to the user DATAKEEPER\dave SID (S-1-5-21-25339xxxxx-208xxx580-6xxx06984-500) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.

Many of the Google search results for that error indicate that the error is benign and include instructions on how to suppress the error instead of fixing it. However, I was pretty sure this error was the cause of my current failure to be able to send an email alert from a Scheduled Event that was triggered from a monitored Event Log entry, so I needed to fix it.

After much searching, I stumbled upon this newsgroup discussion.  The response from Marc Whittlesey pointed me in the right direction. This is what he wrote…

There are 2 registry keys you have to set permissions before you go to the DCOM Configuration in Component services: CLSID key and APPID key.

I suggest you to follow some steps to fix issue:

1. Press Windows + R keys and type regedit and press Enter.
2. Go to HKEY_Classes_Root\CLSID\*CLSID*.
3. Right click on it then select permission.
4. Click Advance and change the owner to administrator. Also click the box that will appear below the owner line.
5. Apply full control.
6. Close the tab then go to HKEY_LocalMachine\Software\Classes\AppID\*APPID*.
7. Right click on it then select permission.
8. Click Advance and change the owner to administrators.
9. Click the box that will appear below the owner line.
10. Click Apply and grant full control to Administrators.
11. Close all tabs and go to Administrative tool.
12. Open component services.
13. Click Computer, click my computer, and then click DCOM.
14. Look for the corresponding service that appears on the error viewer.
15. Right click on it then click properties.
16. Click security tab then click Add User, Add System then apply.
17. Tick the Activate local box.

So use the relevant keys here and the DCOM Config should give you access to the greyed out areas:
CLSID {D63B10C5-BB46-4990-A94F-E40B9D520160}

APPID {9CA88EE3-ACB7-47C8-AFC4-AB702511C276}

I was able to follow Steps 1-15 pretty much verbatim. However, when I got to Step 16 I really couldn’t tell exactly what he wanted me to do. At first I granted the DATAKEEPER\dave user account Full Control to the RuntimeBroker, but that didn’t fix things. Eventually I just selected “Use Default” on all three permissions and that fixed the issue.

RuntimeBroker
I’m not sure how or why this happened, but I figured I better write it all down in case it happens again because it took me a while to figure it out.

Step 4 – Automating the Deployment

If you need to enable the same alerts on multiple systems you can simply export your Task to an XML file and Import it on your other systems.

ExportImport

Or even better yet, automate the Import as part of your build process through a Powershell script after making your XML file available on a file share as shown in the following example.

PS C:\> Register-ScheduledTask -Xml (get-content '\\myfileshare\tasks\DataKeeperAlerts.xml' | out-string) -TaskName "DataKeeper Service Alerts" -User datakeeper\dave -Password MyDomainP@55W0rd –Force

 

In Summary

Hopefully what I have provided will give you everything you need to start receiving alert notification emails on whichever Windows Services keep you up at night.

This concludes my series on configuring email alerts. In this series I covered covered configuring alerts based on Perfmon counters, Event Log Entries and in this article Windows Service Start and Stop events. Of course you can extend these Powershell scripts described in these articles to do more than just send emails. Many alerts or unexpected service stoppages generally require some remediation, so why not just script out the recovery steps and let the triggered task take care of the issue for you?

Personally I recommend that you invest in SCOM , SolarWinds or some other Enterprise Management System, but if that is not in the cards where you work then these articles can help in a pinch.

Step-by-Step: How to Trigger an Email Alert when a Specific Windows Service Starts or Stops on Windows Server 2016

Step-by-Step: How to Trigger an Email Alert from a Windows Event that Includes the Event Details using Windows Server 2016

Introduction

Setting up an email alert is as simple as creating a Windows Task that is triggered by an Event. You then must specify the action that will occur when that Task is triggered. Since Microsoft has decided to deprecate the “Send an e-mail” option the only choice we have is to Start a Program. In our case that program will be a Powershell script that will collect the Event Log information and parse it so that we can send an email that includes important Log Event details.

This work was verified on Windows Server 2016, but I suspect it should work on Windows Server 2012 R2 and Windows Server 2019 as well. If you get it working on any other platforms please comment and let us know if you had to change anything.

Step 1 – Write a Powershell Script

The first thing that you need to do is write a Powershell script that when run can send an email. While researching this I discovered many ways to accomplish this task, so what I’m about to show you is just one way, but feel free to experiment and use what is right for your environment.

In my lab I do not run my own SMTP server, so I had to write a script that could leverage my Gmail account. You will see in my Powershell script the password to the email account that authenticates to the SMTP server is in plain text. If you are concerned that someone may have access to your script and discover your password then you will want to encrypt your credentials. Gmail requires and SSL connection so your password should be safe on the wire, just like any other email client.

Here is an example of a Powershell script that when used in conjunction with Task Scheduler which will send an email alert automatically when any specified Event is logged in the Windows Event Log. In my environment I saved this script to C:\Alerts\DataKeeper.ps1

$EventId = 16,20,23,150,219,220

$A = Get-WinEvent -MaxEvents 1  -FilterHashTable @{Logname = "System" ; ID = $EventId}
$Message = $A.Message
$EventID = $A.Id
$MachineName = $A.MachineName
$Source = $A.ProviderName


$EmailFrom = "sios@medfordband.com"
$EmailTo = "sios@medfordband.com"
$Subject ="Alert From $MachineName"
$Body = "EventID: $EventID`nSource: $Source`nMachineName: $MachineName `nMessage: $Message"
$SMTPServer = "smtp.gmail.com"
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 587)
$SMTPClient.EnableSsl = $true
$SMTPClient.Credentials = New-Object System.Net.NetworkCredential("sios@medfordband.com", "mySMTPP@55w0rd");
$SMTPClient.Send($EmailFrom, $EmailTo, $Subject, $Body)

An example of an email generated from that Powershell script looks like this.

Email1

You probably noticed that this Powershell script uses the Get-WinEvent cmdlet to grab the most recent Event Log entry based upon the LogName, Source and eventIDs specified. It then parses that event and assigns the EventID, Source, MachineName and Message to variables that will be used to compose the email. You will see that the LogName, Source and eventIDs specified are the same as the ones you will specify when you set up the Scheduled Task in Step 2.

Step 2 – Set Up a Scheduled Task

In Task Scheduler Create a Task as show in the following screen shots.

  1. Create Task
    Create Task

    Make sure the task is set to Run whether the user is logged on or not.
    DataKeeper Alerts

  2.  On the Triggers tab choose New to create a Trigger that will begin the task “On an Event”. In my example I will be creating an event that triggers any time DataKeeper (extmirr) logs an important event to the System log.
    Create Task 3
    Create a custom event and New Event Filter as shown below…

    Create Task - Trigger

    For my trigger I am triggering on commonly monitored SIOS DataKeeper (ExtMirr) EventIDs 16, 20, 23,150,219,220 . You will need to set up your event to trigger on the specific Events that you want to monitor. You can put multiple Triggers in the same Task if you want to be notified about events that come from different logs or sources.

    Edit Event Filter
    Create a New Event Filter

     

  3. Once the Event Trigger is configured, you will need to configure the Action that occurs when the event is run. In our case we are going to run the Powershell script that we created in Step 1.
    Actions - 2

    Edit - Actions

  4. The default Condition parameters should be sufficient.
    Conditions - 1
  5. And finally, on the Settings tab make sure you allow the task to be run on demand and to “Queue a new instance” if a task is already running.

    2018-10-28_00-17-27

Step 3 (if necessary) – Fix the Microsoft-Windows-DistributedCOM Event ID: 10016 Error

In theory, if you did everything correctly you should now start receiving emails any time one of the events you are monitoring gets logged in the event log.  However, I ran into a weird permission issue on one of my servers that I had to address before everything worked. I’m not sure if you will run into this issue, but just in case here is the fix.

In my case when I manually triggered the event, or if I ran the Powershell script directly, everything worked as expected and I received an email. However, if one of the EventIDs being monitored was logged into the event log it would not result in an email being sent. The only clue I had was the Event ID: 10016 that was logged in my Systems event log each time I expected the Task Trigger to detect a logged event.

Log Name: System
Source: Microsoft-Windows-DistributedCOM
Date: 10/27/2018 5:59:47 PM
Event ID: 10016
Task Category: None
Level: Error
Keywords: Classic
User: DATAKEEPER\dave
Computer: sql1.datakeeper.local
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID 
{D63B10C5-BB46-4990-A94F-E40B9D520160}
and APPID 
{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}
to the user DATAKEEPER\dave SID (S-1-5-21-25339xxxxx-208xxx580-6xxx06984-500) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.

Many of the Google search results for that error indicate that the error is benign and include instructions on how to suppress the error instead of fixing it. However, I was pretty sure this error was the cause of my current failure to be able to send an email alert from a Scheduled Event that was triggered from a monitored Event Log entry, so I needed to fix it.

After much searching, I stumbled upon this newsgroup discussion.  The response from Marc Whittlesey pointed me in the right direction. This is what he wrote…

There are 2 registry keys you have to set permissions before you go to the DCOM Configuration in Component services: CLSID key and APPID key.

I suggest you to follow some steps to fix issue:

1. Press Windows + R keys and type regedit and press Enter.
2. Go to HKEY_Classes_Root\CLSID\*CLSID*.
3. Right click on it then select permission.
4. Click Advance and change the owner to administrator. Also click the box that will appear below the owner line.
5. Apply full control.
6. Close the tab then go to HKEY_LocalMachine\Software\Classes\AppID\*APPID*.
7. Right click on it then select permission.
8. Click Advance and change the owner to administrators.
9. Click the box that will appear below the owner line.
10. Click Apply and grant full control to Administrators.
11. Close all tabs and go to Administrative tool.
12. Open component services.
13. Click Computer, click my computer, and then click DCOM.
14. Look for the corresponding service that appears on the error viewer.
15. Right click on it then click properties.
16. Click security tab then click Add User, Add System then apply.
17. Tick the Activate local box.

So use the relevant keys here and the DCOM Config should give you access to the greyed out areas:
CLSID {D63B10C5-BB46-4990-A94F-E40B9D520160}

APPID {9CA88EE3-ACB7-47C8-AFC4-AB702511C276}

I was able to follow Steps 1-15 pretty much verbatim. However, when I got to Step 16 I really couldn’t tell exactly what he wanted me to do. At first I granted the DATAKEEPER\dave user account Full Control to the RuntimeBroker, but that didn’t fix things. Eventually I just selected “Use Default” on all three permissions and that fixed the issue.

RuntimeBroker
I’m not sure how or why this happened, but I figured I better write it all down in case it happens again because it took me a while to figure it out.

Step 4 – Automating the Deployment

If you need to enable the same alerts on multiple systems you can simply export your Task to an XML file and Import it on your other systems.

ExportImport

Or even better yet, automate the Import as part of your build process through a Powershell script after making your XML file available on a file share as shown in the following example.

PS C:\> Register-ScheduledTask -Xml (get-content '\\myfileshare\tasks\DataKeeperAlerts.xml' | out-string) -TaskName "DataKeeperAlerts" -User datakeeper\dave -Password MyDomainP@55W0rd –Force

 

In Summary

Hopefully what I have provided will give you everything you need to start receiving alert notification emails on whichever Event Log entries keep you up at night.

In my next post I will show you how to be notified when a specified Service either starts or stops. Of course you could just monitor for EventID 7036 from Service Control Monitor, but that would notify you whenever ANY service starts or stops. We will need to dig a little deeper to make sure we get notified only when the services we care about start or stop.

Step-by-Step: How to Trigger an Email Alert from a Windows Event that Includes the Event Details using Windows Server 2016

Azure Outage Post-Mortem Part 3

My previous blog posts, Azure Outage Post-Mortem – Part 1 and Azure Outage Post-Mortem Part 2,made some assumptions based upon limited information coming from blog posts and twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime tomorrow you should be able to view the session for yourself.

BRK3075 – Preparing for the unexpected: Anatomy of an Azure outage

The official Root Cause Analysis they said will be published soon, but in the meantime here are some tidbits of information gleaned from the session.

The outage was NOT caused by a lightning strike as previously reported. Instead, due to the nature of the storm there were electrical storm sags and swells, which locked out a chiller plant in the 1st datacenter. During this first outage they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter which was not recovered properly, which began an unfortunate series of events.

During this 2nd outage, Microsoft states that “Engineers didn’t triage alerts correctly – chiller plant recovery was not prioritized”. There were numerous alerts being triggered at this time, and unfortunately the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.

Microsoft states that of course redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested, so it was set to manual mode until testing had been completed.

After 45 minutes the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire, and staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing and some hardware was not shut down properly, causing damage to some storage and networking.

After manually resetting the chillers and opening the air handlers the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.

The biggest issue was there was damage to storage. Microsoft’s primary concern is data protection, so short of the enter datacenter sinking into a sinkhole or a meteor strike taking out the datacenter, Microsoft will work to recover data to ensure no data loss. This of course took some time, which extend the overall length of the outage. The good news is that no customer data was lost, the bad news is that it seemed like it took 24-48 hours for things to return to normal, based upon what I read on Twitter from customers complaining about the prolonged outage.

Everyone expected that this outage would impact customers hosted in the South Central Region, but what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.

Azure Service Manager (ASM) – This controls Azure “Classic” resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn’t clear to me why this happened, but it appears that South Central Region hosts some important components of that service which became unavailable.

Visual Studio Team Service (VSTS) – Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.

Postmortem: VSTS 4 September 2018

Azure Active Directory (AAD) – When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling, but the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic, and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.

They ran out of time to discuss this further during the Ignite session, but one feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO) the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center should Microsoft experience another extended outage in the future.

Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper Azure Site Recovery, or application specific replication solutions which give you the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.

 

 

Azure Outage Post-Mortem Part 3

Azure Outage Post-Mortem Part 2

My previous blog post says that Cloud-to-Cloud or Hybrid-Cloud would give you the most isolation from just about any issue a CSP could encounter. However, in this particular failure had Availability Zones been available in the South Central region most of the downtime caused by this natural disaster could have been avoided. Microsoft published a Preliminary RCA of the September 4th South Central Outage.

The most important part of that whole summary is as follows…

“Despite onsite redundancies, there are scenarios in which a datacenter cooling failure can impact customer workloads in the affected datacenter.”

What does that mean to you? If your applications all run in the same datacenter you are susceptible to the same type of outage in the future. In Microsoft’s defense, this really shouldn’t be news to you as this has always been true whether you run in Azure, AWS, Google or even your own datacenter. Failure to plan ahead with data replication to a different datacenter and a plan in place to quickly recover your applications in those datacenters in the event of a disaster is simply a lack of planning on your part.

While Microsoft doesn’t publish exact Availability Zone locations, if you believe this map published here you could guess that they are probably anywhere from a 2-10 miles apart from each other.

Azure Datacenters.png

In all but the most extreme cases, replicating data across Availability Zones should be sufficient for data protection. Some applications such as SQL Server have built in replication technology, but for a broad range of applications, operating systems and data types you will want to investigate block level replication SANless cluster solutions. SANless cluster solutions have traditionally been used for multisite clusters, but the same technology can also be used in the cloud across Availability Zones, Regions, or Hybrid-Cloud for high availability and disaster recovery.

Implementing a SANless cluster that spans Availability Zones, whether it is Azure, AWS or Google, is a pretty simple process given the right tools. Here are a few resources to help get you started.

Step-by-Step: Configuring a File Server Cluster in Azure that Spans Availability Zones

How to Build a SANless SQL Server Failover Cluster Instance in Google Cloud Platform

MS SQL Server v.Next on Linux with Replication and High Availability #Azure #Cloud #Linux

Deploying Microsoft SQL Server 2014 Failover Clusters in #Azure Resource Manager (ARM)

SANless SQL Server Clusters in AWS

SANless Linux Cluster in AWS Quick Start

If you are in Azure you may also want to consider Azure Site Recovery (ASR). ASR lets you replicate the entire VM from one Azure region to another region. ASR will replicate your VMs in real-time and allow you to do a non-disruptive DR test whenever you like. It supports most versions of Windows and Linux and is relatively easy to set up.

You can also create replication jobs that have “Multi-VM Consistency”, meaning that servers that must be recovered from the exact same point in time can be put together in this consistency group and they will have the exact same recovery point. What this means is if you wanted to build a SANless cluster with DataKeeper in a single region for high availability you have two options for DR. One is you could extend your SANless cluster to a node in a different region, or else you could simply use ASR to replicate both nodes in a consistency group.

asr

The trade off with ASR is that the RPO and RTO is not as good as you will get with a SANless multi-site cluster, but it is easy to configure and works with just about any application. Just be careful, if your application exceeds 10 MBps in disk write activity on a regular basis ASR will not be able to keep up. Also, clusters based on Storage Spaces Direct cannot be replicated with ASR and in general lack a good DR strategy when used in Azure.

For a while after Managed Disks were released ASR did not fully support them until about a year later. Full support for Managed Disks was a big hurdle for many people looking to use ASR. Fortunately since about February of 2018 ASR fully supports Managed Disks. However, there is another problem that was just introduced.

With the introduction of Availability Zones ASR is once again caught behind the times as they currently don’t support VMs that have been deployed in Availability Zones.

2018-09-25_00-10-24
Support matrix for replicating from one Azure region to another

I went ahead and tried it anyway. I seemed to be able to configure replication and was able to do a test failover.

ASR-and-AZ
I used ASR to replicate SQL1 and SQL3 from Central to East US 2 and did a test failover. Other than not placing the VMs in AZs in East US 2 it seems to work.

I’m hoping to find out more about this limitation at the Ignite conference. I don’t think this limitation is as critical as the Managed Disk limitation was, just because Availability Zones aren’t widely available yet. So hopefully ASR will pick up support for Availability Zones as other regions light up Availability Zones and they are more widely adopted.

 

 

Azure Outage Post-Mortem Part 2

Lightning Never Strikes Twice: Surviving the #Azure Cloud Outage

Yesterday morning I opened my Twitter feed to find that many people were impacted by an Azure outage. When I tried to access the resource page that described the outage and the current resources impacted even that page was unavailable. @AzureSupport was providing updates via Twitter.

The original update from @AzureSupport came in at 7:12 AM EDT

Azure Outage 2

Looking back on the Twitter feed it seems as if the problem initially began an hour or two before that.

Azure Support 10

It quickly became apparent that the outages had a wider spread impact than just the SOUTH CENTRAL US region as originally reported. It seems as if services that relied on Azure Active Directory could have been impacted as well and customers trying to provision new subscriptions were having issues.

Azure 11

And 24 hours later the problem has not been completely resolved and it according to the last update this morning…

Azure Outage 1

Untitled design (6)

So what could you have done to minimize the impact of this outage? No one can blame Microsoft for a natural disaster such as a lightning strike. But at the end of the day if your only disaster recovery plan is to call, tweet and email Microsoft until the issue is resolved, you just received a rude awakening. IT IS UP TO YOU to ensure you have covered all the bases when it comes to your disaster recovery plan.

While the dust is still settling on exactly what was impacted and what customers could have done to minimize the downtime, here are some of my initial thoughts.

Availability Sets (Fault Domains/Update Domains) – In this scenario, even if you built Failover Clusters, or leveraged Azure Load Balancers and Availability Sets, it seems the entire region went offline so you still would have been out of luck. While it is still recommended to leverage Availability Sets, especially for planned downtime, in this case you still would have been offline.

Availability Zones – While not available in the SOUTH CENTRAL US region yet, it seems that the concept of Availability Zones being rolled out in Azure could have minimized the impact of the outage. Assuming the lightning strike only impacted one datacenter, the other datacenter in the other Availability Zone should have remained operational. However, the outages of the other non-regional services such as Azure Active Directory (AAD) seems to have impacted multiple regions, so I don’t think Availability Zones would have isolated you completely.

Global Load Balancers, Cross Region Failover Clusters, etc. – Whether you are building SANLess clusters that cross regions, or using global load balancers to spread the load across multiple regions, you may have minimized the impact of the outage in SOUTH CENTRAL US, but you may have still been susceptible to the AAD outage.

Hybrid-Cloud, Cross Cloud – About the only way you could guarantee resiliency in a cloud wide failure scenario such as the one Azure just experienced is to have a DR plan that includes having realtime replication of data to a target outside of your primary cloud provider and a plan in place to bring applications online quickly in this other location. These two locations should be entirely independent and should not rely on services from your primary location to be available, such as AAD. The DR location could be another cloud provider, in this case AWS or Google Cloud Platform seem like logical alternatives, or it could be your own datacenter, but that kind of defeats the purpose of running in the cloud in the first place.

Software as a Service – While Software as service such as Azure Active Directory (ADD), Azure SQL Database (Database-as-Service) or one of the many SaaS offerings from any of the cloud providers can seem enticing, you really need to plan for the worst case scenario. Because you are trusting a business critical application to a single vendor you may have very little control in terms of DR options that includes recovery OUTSIDE of the current cloud service provider. I don’t have any words of wisdom here other than investigate your DR options before implementing any SaaS service, and if recovery outside of the cloud is not an option than think long and hard before you sign-up for that service. Minimally make the business stake owners aware that if the cloud service provider has a really bad day and that service is offline there may be nothing you can do about it other than call and complain.

I think in the very near future you will start to hear more and more about cross cloud availability and people leveraging solutions like SIOS DataKeeper to build robust HA and DR strategies that cross cloud providers. Truly cross cloud or hybrid cloud models are the only way to truly insulate yourself from most conceivable cloud outages.

If you were impacted from this latest outage I’d love to hear from you. Tell me what went down, how long you were down, and what you did to recover. What are you planning to do so that in the future your experience is better?

Lightning Never Strikes Twice: Surviving the #Azure Cloud Outage

“Incomplete Communication with Cluster” with local Storage Space for SQL Server cluster

When building a SANless SQL Server cluster with SIOS DataKeeper, or when configuring Always On Availability Groups for SQL Server, you may consider striping together multiple disk in a Simple Storage Space (RAID 0) for performance. This is very commonly done in the cloud where each instance typically his backed by hardware resiliency, so RAID 0 is not really all that risky.

For instance, I had a recent customer in AWS that wanted to max out his IOPS to 80,000, the maximum IOPS currently available to a single instance. Now keep in mind, only the largest EBS optimized instance sizes supports 80,000 IOPS, so you want to make sure you know what maximum IOPS your particular instance size supports.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

In this case we had ac5.18xlarge instance which does support 80,000 IOPS. However, any individual EBS Provisioned IOPS volume only supports up to 32,000 IOPS. The only way to achieve 80,000 IOPS when writing to any single volume is to strip three of these volumes together in a Simple Storage Space.

Herein lies the rub, if you try to do that in an existing cluster things are going to go haywire pretty fast. Fellow MVP Joey D’Antoni recently blogged about the issue and it appears to still be an issue in the Windows Server 2019 preview.

Just as Joey suggests, I always advise my customers to build out the nodes and any Storage Spaces BEFORE they start the clustering process. This makes the process go much smoother. It also allows the customer to have some time to benchmark the server’s performance before they add any replication, to  ensure everything is working as expected.

 

 

“Incomplete Communication with Cluster” with local Storage Space for SQL Server cluster

Your student could be the next Doogie Howser of Cloud Computing with free training and cloud computing resources

Students with any interest in Information Technology or Computer Science are going to be joining a world dominated by Cloud Computing. And of course the major cloud service providers (CSP) would all love to see the young people embrace their cloud platform to host the next big thing like Facebook, Instagram or SnapChat. The top three CSP all have free offerings for students, hoping to win their minds and hearts.

But before you jump right in to cloud computing, the novice student might want to start with some basic fundamentals of computer programming at one of the many free online resources, including Khan Academy.

feature_khanacademy.png

Microsoft is offering free Azure services for students. There are two different offerings. The first is targeted at high school students ages 13+ and the second is geared towards college students 18+.

microsoft-azure-1.png

Microsoft Azure for Students Starter Offer is for those high school students that are interested in building applications in the cloud. While there are not as many free services or credits as being offered at the college level, there is certainly enough available for free to really get some hands on experience with some cutting edge technology for the self starter. How cool would it be for your high school to start a Cloud Computing Club, or to integrate this offering into some of the IT classes they may already be taking.

Azure for Students is targeted at the college level student and has many more features available for free. Any student in computer science or information technology should definitely get some hands on experience with these cutting edge cloud technologies and this is the perfect way to do it with no additional out of pocket expense.

A good way to get introduced to the Azure Cloud is to start with some free online training courses Microsoft delivers in partnership with Pluralsight.

logo_aws-educate.812809f63186598d26a56d443d829afa390566d1.png

AWS Educate. Not to be outdone, AWS also offers some free cloud services to students and educators. These seem to be in terms of free cloud credits, which if managed properly can go a long way. AWS also delivers an educational program that can be combined with an AP class in Computer Science if your high school wants to participate.

 

Google-Cloud-Platform.png

Google Cloud Platform (GCP) also has education grants available for computer science majors at accredited universities. These seem to be the most restrictive of the three as they are available for Computer Science Majors only at accredited universities.

GCP does also offer training, but from what I can find I don’t see any free training offerings. If you want some hands on training you will have to register for some classes. The plus side of this is that these classes all seem to be instructor led, either online or in an actual classroom. The downside is I don’t think a lot of 13 year olds are going to shell out any money to start developing on the CGP when there are other free training opportunities available on AWS or Azure.

For the ambitious young student, the resources are certainly there for you to be the next Doogie Howser of Cloud Computing.

Doogie Howser_Cloud MD_.png

 

Your student could be the next Doogie Howser of Cloud Computing with free training and cloud computing resources

Help! I can’t connect to my SQL Server multi-subnet failover cluster

I get that kind of call or email from customers all the time. I have a generic response as follows…

This has everything you need to know.

They don’t go into great detail about what to do if your connection does not support multisubnetfailover=true. If your connection does NOT support that parameter, then set registerallprovidersip to false and cleanup DNS. That procedure is described best here.
I figure I get this question often enough I probably should just flesh out my response a bit, hence the reason for this post.
In general people just aren’t aware of how multi-subnet failover clusters work. Multi-subnet failover clustering support was added in Windows Server 2012 with the addition of the “OR” technology when defining cluster resource dependencies. This allowed people to allow a Cluster Name resource to be dependent upon IP Address x.x.x.x OR IP Address y.y.y.y.
x.x.x.x would be an a cluster IP resource valid in Subnet A and y.y.y.y would be a cluster IP address valid in Subnet B. Only one address will be online at any given time, whichever address was valid for the subnet the resource was currently running on.
Microsoft SQL Server started supporting this concept starting with SQL Server 2012 with both failover cluster instances (FCI) using 3-party SANless clustering solutions like SIOS DataKeeper and SQL Server Always On Availability Groups.
By default if you create a SQL Server multi-subnet failover cluster the cluster should be automatically configured optimally, including setting up the two IP addresses, adding two A records to DNS and setting the registerallprovidersIP to true. However, on the client end you need to tell it that you are connecting to a multi-subnet failover cluster, otherwise the connection won’t be made.

Configuring the client

Configuring the client is done by adding multisubnetfailover=true to the connection string. This Microsoft documentation is a great resource, but if you just search for multisubnetfailover=true you will find a lot of information about that setting.
However, not every application will support adding that to the connection string. If you find yourself in that situation you should ask your application vendor to add support for that or show you how to do it.
However, all is not lost if you find yourself in that situation. You will want to change the behavior of the cluster so that upon failover DNS is update so that the single A record associated with the cluster client access point is updated with the new IP address. This is in lieu of having two A records in DNS, one with each cluster IP address, which is the default behavior in an multi-subnet cluster.
This article reference SharePoint, you can ignore that, the rest of the article is pretty well written to describe the process you should follow.
The highlights of that article are as follows…
Get-ClusterResource “[Network Name]” | Set-ClusterParameter RegisterAllProvidersIP 0
After restarting the cluster-name-object (basically restarting the role) & cleaning up all “A” records manually (clean-up isn’t done automatically) we can see our old A-records are still in DNS so we’ll need to delete those manually.
In addition to those steps I’d advise you to reduce the TTL on the HostRecordTTL as described in this article.
The highlight of that article is as follows.
PS C:\> Get-ClusterResource -Name cluster1FS | Set-ClusterParameter -Name HostRecordTTL -Value 300
With a Value of 300 you could potentially be waiting up to 5 minutes for your clients to reconnect after a failover, or even longer if if have a large Active Directory infrastructure and AD replication takes some time to update all the DNS servers across your infrastructure.
You are going to want to figure out what the optimal TTL is to facilitate quick client reconnections without over burdening your DNS servers with a bunch of DNS Lookup requests.
This type of configuration is common in disaster recovery configurations where your DR site is in a different subnet. It is also very common in HA deployments in AWS because different Availability Zones are in different subnets.
Let me know if you have any questions. You can always reach me on Twitter @daveberm
Help! I can’t connect to my SQL Server multi-subnet failover cluster

STORAGE SPACES DIRECT (S2D) FOR SQL SERVER FAILOVER CLUSTER INSTANCES (FCI)?

With the introduction of Windows Server 2016 Datacenter Edition a new feature called Storage Spaces Direct (S2D) was introduced. At a very high level, this solution allows you to pool together locally attached storage and present it to the cluster as a CSV for use in a Scale Out File Server, which can then be accessed over SMB 3 and used to hold cluster data such as Hyper-V VMDK files. This can also be configured in a hyper-converged (HCI) fashion such that the application and data can all run on the same set of servers.  This is a grossly over-simplified description, but for details, you will want to look here.

Storage Spaces Direct Stack

Image taken from https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-overview

The main use case targeted is hyper-converged infrastructure for Hyper-V deployments. However, there are other use cases, including leveraging this SMB storage to store SQL Server Data to be used in a SQL Server Failover Cluster Instance

Why would anyone want to do that? Well, for starters you can now build a highly available 2-node SQL Server Failover Cluster Instance (FCI) with SQL Server Standard Edition, without the need for shared storage. Previously, if you wanted HA without a SAN you pretty much were driven to buy SQL Server Enterprise Edition and make use of Always On Availability Groups or purchase SIOS DataKeeper and leverage the 3rd party solution which lets you build SANless clusters with any version of Windows or SQL Server. SQL Server Enterprise Edition can really drive up the cost of your project, especially if you were only buying it for the Availability Groups feature.

In addition to the cost associated with Availability Groups, there are a number of other technical reasons why you might prefer a Failover Cluster over an AG. Application compatibility, instance vs. database level protection, large number of databases, DTC support, trained staff, etc., are just some of the technical reasons why you may want to stick with a Failover Cluster Instance.

Microsoft lists both the SIOS DataKeeper solution and the S2D solution as two of the supported solutions for SQL Server FCI in their documentation here.

s2d

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sql/virtual-machines-windows-sql-high-availability-dr

When comparing the two solutions, you have to take into account that SIOS has been allowing you to build SANless Clusters since 1999, while the S2D solution is still in its infancy.  Having said that, there are bound to be some areas where S2D has some catching up to do, or simply features that they will never support simply due to the limitations with the technology.

Have a look at the following table for an overview of some of the things you should consider before you choose your SANless cluster solution.

2018-10-05_21-13-59

If we go through this chart, we see that SIOS DataKeeper clearly has some significant advantages. For one, DataKeeper supports a much wider range of platforms, going all the way back to Windows Server 2008 R2 and SQL Server 2008 R2. The S2D solution only supports the latest releases of Windows and SQL Server 2016/2017. S2D also requires the  Datacenter Edition of Windows, which can add significantly to the cost of your deployment. In addition, SIOS delivers the ONLY HA/DR solution for SQL Server on Linux that works both on-prem and in the cloud.

I’ve been talking to a lot of customers recently who are reporting some performance issues with S2D. When I tested S2D vs. DataKeeper about a year ago I didn’t see any significant differences in performance, but I did see S2D used about 2x the amount of CPU resources under the same load. This probably has to do with the high hardware requirements associated with S2D such as RDMA enabled networking and available Flash Storage, typically only available in the most expensive cloud based images.

“We recommend the I3 instance size because it satisfies the S2D hardware requirements and includes the largest and fastest instance store devices available.”

But beyond the cost and platform limitations, I think the most glaring gap comes when we start to consider that S2D does not support Availability Zones or disaster recovery configurations such as multi-site clusters or Azure Site Recovery (ASR). Allan Hirt, SQL Server Cluster guru and fellow Microsoft Cloud and Datacenter Management MVP, recently posted about this S2D limitation. In his article Revisiting Storage Spaces Direct and SQL Server FCIs  Allan points out that due to the lack of support for stretching S2D clusters across sites or including an S2D based cluster as a leg in an Always On Availability Group, the best option for DR in the S2D scenario is log shipping! This even includes replicating across Availability Zones in either Azure or AWS.

Microsoft does not make it clear in their documentation, but Microsoft’s own PM for High Availability and Storage makes it perfectly clear in the Microsoft forums.

AWS also documents S2D’s lack of Availability Zone support…

“Each cluster node must be deployed in a different subnet. This architecture will be deployed into a single availability zone because Microsoft does not currently support stretch cluster with Storage Spaces Direct. ” – AWS Documentation on S2D

Deploying S2D cluster nodes within the same Availability Zone defeats the purpose of failover clustering and the deployment does not qualify for the AWS 99.99% SLA. Even if you wanted to deploy S2D in a single Availability Zone the deployment becomes even more complicated because it is recommended that you deploy at least three cluster nodes and each node must reside in its own subnet due to some AWS networking restrictions that requires each cluster node reside in a different subnet. S2D was never designed to run in different subnets, which further complicates the solution in terms of client redirection.

In contrast, the SIOS DataKeeper solution fully supports Always On Availability Groups, and better yet – it can allow you to stretch your FCI across sites to give you the best HA/DR solution you could hope to achieve in terms of RTO/RPO. DataKeeper supports Availability Zones and DR configurations that cross cloud regions. In an Azure environment, DataKeeper also support Azure Site Recovery (ASR), giving you even more options for disaster recovery.

Further complicating any S2D deployment in AWS is the reliance on “local instance store” storage, AKA, non-persistent ephemeral disks.

“The best performance for storage can be achieved using I3 instances because they provide local instance store with NVMe and high network performance”

Reliance on ephemeral storage puts your data at risk any time a disk rebuilds, which can happen at any time, but always happens when an instance is stopped. If a disk is lost and a second disk is lost before the first disk rebuilds you are looking at complete data loss and a restore from backup. If someone accidentally stops all the nodes in your cluster your data will be lost! Even if you take care to only stop one node at a time if you are not paying attention and waiting for a disk to complete a rebuild after you stop the second node you will also experience complete data loss!

The rest of this chart is pretty self explanatory. It basically consist of a list hardware, storage and networking requirements that must be met before you can deploy an S2D cluster. An exhaustive list of S2D requirements is maintained here.  https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-hardware-requirements

The SIOS DataKeeper solution is much more lenient. It supports any locally attached storage and as long as the hardware passes cluster validation, it is a supported cluster configuration. The block level replication solution has been working great ever since 1 Gbps was considered a fast LAN and a T1 WAN connection was considered a luxury.

SANless clustering is particularly interesting for cloud deployments. The cloud does not offer traditional shared storage options for clusters. So for users in the middle of a “lift and shift” to the cloud that want to take their clusters with them they must look at alternate storage solutions. For cloud deployments, SIOS is certified for AzureAWS and Google and available in the relevant cloud marketplace. While there doesn’t appear to be anything blocking deployment of S2D based clusters in AWS or Google, there is a conspicuous lack of documentation or supportability statements from Microsoft for those platforms.

SIOS DataKeeper has been doing this since 1999. SIOS has heard all the feature requests, uncovered all the bugs, and has a rock solid solution for SANless clusters that is time tested and proven. While Microsoft S2D is a promising technology, as a 1st generation product I would wait until the dust settles and some of the feature gap closes before I would consider it for my business critical applications.

STORAGE SPACES DIRECT (S2D) FOR SQL SERVER FAILOVER CLUSTER INSTANCES (FCI)?

#Azure Storage Service Interruption…Time for “Plan B”

Yesterday evening Pacific Standard Time, Azure storage services experienced a service interruption across the United States, Europe and parts of Asia, which impacted multiple cloud services in these regions.

As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services.

Read the whole report on the Azure blog. http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/

So what does this outage mean to those thinking about a cloud deployment? Global “interruptions” of this magnitude certainly cannot occur on any regular basis for any cloud provider that intends to remain in the cloud business, whether they are Microsoft, Amazon, Google or other. However, as a cloud architect or person responsible for a cloud deployment, you have a responsibility to your customer to have a “Plan B” in your back pocket in case the worst case scenario actually happens.

What exactly is a “Plan B”? Plan B involves having a documented procedure for recovering data and services in an alternate location in the event of a wide spread outage that impacts a cloud provider’s ability to deliver their service, despite deploying what you thought was a highly resilient cloud deployment designed to keep running even in the event of localized outages within a region, availability zone or fault domain.

At a high level you should be concerned about three things: Data Recovery, Application Recovery, and Client Access. There are many ways to address these concerns, some more automated than others and some with a better Recovery Time Objective (RTO) and Recovery Point Objective (RPO) than others.

It was just last week that I blogged about how to create a multisite cluster that stretched between the AWS cloud and the Azure cloud. This type of configuration is just what is needed in the event of an outage of the magnitude that we just experienced yesterday in the Azure cloud. https://clusteringformeremortals.com/2014/11/18/cloud-resiliency-for-sqlserver-failover-clusters-aws-to-azure-multisite-cluster/

Figure 1 – Example of a Cloud-to-Cloud Multisite Cluster Configuration

Another alternative to the “cloud-to-cloud” replication model is of course utilizing your own datacenter as a disaster recovery site for your cloud deployment. The advantages of this is that you have physical ownership of your data, but of course now you are back in the business of managing a datacenter, which can negate some of the benefit of a pure cloud deployment.

Figure 2 – Hybrid Cloud Deployment Model

If you are not ready to go full on cloud, you can still make use of the cloud as a disaster recovery site. This is probably the easiest and most cost effective way to implement an offsite datacenter for disaster recovery and to start taking advantage of what the cloud has to offer without fully committing to moving all your workloads into the cloud.

Figure 3 – Using the Cloud as a Disaster Recovery Site

The illustrations shown above make use of the host based replication solution called DataKeeper Cluster Edition to build multisite SQL Server clusters. However, DataKeeper can be used to keep any data in sync, either between different cloud providers or in the hybrid cloud model.

Microsoft is not alone in dealing with cloud outages as outages have impacted Google, Microsoft, Amazon, DropBox and many others just this year alone. Having a “Plan B” in place is a must have anytime you are relying on any cloud service.

#Azure Storage Service Interruption…Time for “Plan B”