TICK TOCK…6 MONTHS UNTIL SQL SERVER 2008/2008 R2 SUPPORT EXPIRES UNLESS YOU TAKE ACTION

If you are still running SQL Server 2008/2008 R2 you probably have heard by now that as of July 9, 2019, you will no longer be supported. However, realizing that there are still a significant number of customers running on this platform that will not be able to upgrade to a newer version of SQL before that deadline, Microsoft has offered two options to provide extended security updates for an additional three years.

The first option you have requires the annual purchase of “Extended Security Updates”. Extended Security Updates will cost 75% of the full license cost annually and also requires that the customer is on active software assurance, which is typically 25% of the license cost annually. So effectively, to receive Extended Security Updates you are paying for new SQL Server licenses annually for three years, or until you migrate off SQL Server 2008/2008 R2.

However, there is another second option. Microsoft has announced that if you move your SQL Server 2008 R2 instances to Azure, you will receive the Extended Security Updates at no additional charge. There is of course the hourly infrastructure charges you will incur in Azure, plus either the cost of pay as you go SQL Server instances or the Software Assurance charges if you want to bring your existing SQL licenses to Azure, but that cost includes the added benefit of running in a state of the art cloud environment which opens up opportunities for enhanced performance and HA/DR scenarios that you may not have had available on premise.

Azure offers many different options in terms of CPU, Memory and Storage configurations. If you are looking for a server or storage upgrade, or your existing on-premise infrastructure was reaching a refresh cycle, now is the perfect time to dip your feet into the Azure cloud and upgrade your performance and availability at the same time as extending the life of your SQL Server 2008/2008 R2 deployment.

In terms of high availability and disaster recovery configurations, Azure offers up to a 99.99% SLA.  To qualify for the SLA you must leveraging their infrastructure appropriately and even then, the SLA only covers “dial tone” to the instance. It is up to you to ensure SQL Server is highly available, which is traditionally done by building a SQL Server Failover Cluster Instance (FCI). Azure has the infrastructure in place which enables you to configure a SQL Server FCI, but due to the lack of cluster aware shared storage in the cloud, you will need to use SIOS DataKeeper to build the FCI.

SIOS DataKeeper takes the place of the shared storage normally required by a SQL Server FCI and instead allows you to leverage the any NTFS formatted volumes that are attached to each instance. SIOS keeps the volumes replicated between the instances and presents the storage to the cluster as a resource called a DataKeeper Volume. As far as the cluster is concerned the DataKeeper Volume looks like a share disk, but instead of controlling SCSI reservations (disk locking), it controls the mirror direction ensuring writes occur on the active server and are synchronously or asynchronously replicated to the other cluster nodes. The end user experience is exactly the same as a traditional shared storage cluster, but under the covers the cluster is leveraging the locally attached storage instead of shared storage.

In Azure your cluster nodes can run in different racks (Fault Domains), data centers (Availability Zones), or even in different geographic regions. SIOS DataKeeper supports all three options: Fault Domains, Availability Zones or cross Region replication to cover both HA and DR requirements. Similar configurations are also possible in the AWS and Google Cloud.

azure ha
Typical 2-node SQL Server FCI configuration in Azure with SIOS DataKeeper

With Azure Site Recovery (ASR) you can replicate standalone or clustered instances of SQL Server between Region Pairs, without the headache and expense of managing your own disaster recovery site. And of course SQL Server seldom lives alone, so at the same time you move your SQL Server instance to Azure you probably want to move your application servers there as well to also take advantage of the performance and availability upgrades available in Azure. Combining SIOS DataKeeper for HA and ASR for DR provides a cost effective HA and DR strategy that would have been impossible, or extremely expensive to implement on premise with SAN replication and your own DR site.

asr - 2
Common configuration leveraging SIOS DataKeeper for HA and Azure Site Recovery for DR

While it only takes a few minutes to spin up a SQL Server instance in Azure, I wouldn’t wait until the last minute to do your migration. Please take the next few months to become familiar with Azure, start doing some testing, and then plan to migrate your workloads well before the July 9, 2019 expiration date. Running SQL Server after that date leaves you susceptible to any new security threats and also puts you out of compliance. Your boss, and more importantly your customers, will be glad to know that their data is still secure, available, and in compliance once you migrate your workload to Azure.

TICK TOCK…6 MONTHS UNTIL SQL SERVER 2008/2008 R2 SUPPORT EXPIRES UNLESS YOU TAKE ACTION

Moving SQL Server 2008 and 2008 R2 clusters to #Azure for Extended Support

Earlier this year Microsoft announced extended support for SQL Server 2008 and 2008 R2 at no additional cost. However, the catch is that you must migrate your SQL Server installation to Azure in order to take advantage of the extended support. For all the details, check out https://www.microsoft.com/en-us/sql-server/sql-server-2008. If you choose not to move, your extended support ends on July 9th, 2019, just about 9 months from now.

2018-10-05_16-45-37

Chances are if you are still running SQL Server 2008 R2 it’s simply because you never upgraded your application, so newer versions of SQL are not supported. Or you simply decided not to fix what isn’t broken. Regardless of these reason, you have just bought yourself another three years of support, if you migrate to Azure.

Now migrating workloads to Azure is a pretty well documented procedure, using Azure Site Recovery, so that process should be pretty seamless for you for your standalone instances of SQL Server.

But what about those clustered instances of SQL Server? You certainly don’t want to give up availability when you move to the Azure. Part of the beauty of Azure is that they have infrastructure that you can only dream of. However, it is incumbent upon the user to configure their applications to take full advantage of the infrastructure to ensure that your deployments are highly available.

With SQL Server 2008 and 2008 R2, high availability commonly means SQL Server Failover Clustering on either Windows Server 2008 R2 or Windows Server 2012 R2. If you are new to Azure you will quickly discover that there is no native option that supports  shared storage clusters. Instead, you will need to look at a SANLess cluster solution such as SIOS DataKeeper. Microsoft list SIOS DataKeeper as the HA solution for SQL Server Failover CLustering in their documentation.

2018-10-05_16-59-39
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sql/virtual-machines-windows-sql-high-availability-dr

 

In order to facilitate a simple migration of your existing SQl Server 2008 or 2008 R2 cluster to Azure here are the high level steps you will need to take.

  • Replace the Physical Disk Resource in your existing on premise SQL Server cluster with a DataKeeper Volume Resource. Do the same for MSDTC resources if you use MSDTC.
  • Remove your Disk Witness and replace it with a File Share Witness
  • Use Azure Site Recovery to replicate your cluster nodes into Azure, making sure each replicated node resides in a different Fault Domain or in different Availability Zones in Azure
  • Recovery your replicate cluster nodes in Azure
  • Replace the File Share Witness with a File Share hosted in Azure
  • Configure the Internal Load Balancer in Azure for client redirection, this includes running the Powershell script on the local nodes to update the SQL Cluster IP resource to listen for the ILB probe
  • Assuming the IP addresses and subnet of the SQL Server cluster instances changed as part of this migration you will also need to do some cleanup of the cluster IP address and the DataKeeper job endpoints to reflect the new IP addresses

I know I left out a lot of the details, but if you find yourself in the position of having to do a lift and shift of SQL Server to Azure, or any cloud for that matter, I’d be glad to get on the phone with you to answer any questions you may have. Keep in mind, the same steps apply for any version of SQL that you plan to migrate to Azure.

 

Moving SQL Server 2008 and 2008 R2 clusters to #Azure for Extended Support

Azure Outage Post-Mortem Part 3

My previous blog posts, Azure Outage Post-Mortem – Part 1 and Azure Outage Post-Mortem Part 2,made some assumptions based upon limited information coming from blog posts and twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime tomorrow you should be able to view the session for yourself.

BRK3075 – Preparing for the unexpected: Anatomy of an Azure outage

The official Root Cause Analysis they said will be published soon, but in the meantime here are some tidbits of information gleaned from the session.

The outage was NOT caused by a lightning strike as previously reported. Instead, due to the nature of the storm there were electrical storm sags and swells, which locked out a chiller plant in the 1st datacenter. During this first outage they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter which was not recovered properly, which began an unfortunate series of events.

During this 2nd outage, Microsoft states that “Engineers didn’t triage alerts correctly – chiller plant recovery was not prioritized”. There were numerous alerts being triggered at this time, and unfortunately the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.

Microsoft states that of course redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested, so it was set to manual mode until testing had been completed.

After 45 minutes the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire, and staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing and some hardware was not shut down properly, causing damage to some storage and networking.

After manually resetting the chillers and opening the air handlers the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.

The biggest issue was there was damage to storage. Microsoft’s primary concern is data protection, so short of the enter datacenter sinking into a sinkhole or a meteor strike taking out the datacenter, Microsoft will work to recover data to ensure no data loss. This of course took some time, which extend the overall length of the outage. The good news is that no customer data was lost, the bad news is that it seemed like it took 24-48 hours for things to return to normal, based upon what I read on Twitter from customers complaining about the prolonged outage.

Everyone expected that this outage would impact customers hosted in the South Central Region, but what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.

Azure Service Manager (ASM) – This controls Azure “Classic” resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn’t clear to me why this happened, but it appears that South Central Region hosts some important components of that service which became unavailable.

Visual Studio Team Service (VSTS) – Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.

Postmortem: VSTS 4 September 2018

Azure Active Directory (AAD) – When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling, but the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic, and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.

They ran out of time to discuss this further during the Ignite session, but one feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO) the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center should Microsoft experience another extended outage in the future.

Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper Azure Site Recovery, or application specific replication solutions which give you the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.

 

 

Azure Outage Post-Mortem Part 3

Azure Outage Post-Mortem Part 2

My previous blog post says that Cloud-to-Cloud or Hybrid-Cloud would give you the most isolation from just about any issue a CSP could encounter. However, in this particular failure had Availability Zones been available in the South Central region most of the downtime caused by this natural disaster could have been avoided. Microsoft published a Preliminary RCA of the September 4th South Central Outage.

The most important part of that whole summary is as follows…

“Despite onsite redundancies, there are scenarios in which a datacenter cooling failure can impact customer workloads in the affected datacenter.”

What does that mean to you? If your applications all run in the same datacenter you are susceptible to the same type of outage in the future. In Microsoft’s defense, this really shouldn’t be news to you as this has always been true whether you run in Azure, AWS, Google or even your own datacenter. Failure to plan ahead with data replication to a different datacenter and a plan in place to quickly recover your applications in those datacenters in the event of a disaster is simply a lack of planning on your part.

While Microsoft doesn’t publish exact Availability Zone locations, if you believe this map published here you could guess that they are probably anywhere from a 2-10 miles apart from each other.

Azure Datacenters.png

In all but the most extreme cases, replicating data across Availability Zones should be sufficient for data protection. Some applications such as SQL Server have built in replication technology, but for a broad range of applications, operating systems and data types you will want to investigate block level replication SANless cluster solutions. SANless cluster solutions have traditionally been used for multisite clusters, but the same technology can also be used in the cloud across Availability Zones, Regions, or Hybrid-Cloud for high availability and disaster recovery.

Implementing a SANless cluster that spans Availability Zones, whether it is Azure, AWS or Google, is a pretty simple process given the right tools. Here are a few resources to help get you started.

Step-by-Step: Configuring a File Server Cluster in Azure that Spans Availability Zones

How to Build a SANless SQL Server Failover Cluster Instance in Google Cloud Platform

MS SQL Server v.Next on Linux with Replication and High Availability #Azure #Cloud #Linux

Deploying Microsoft SQL Server 2014 Failover Clusters in #Azure Resource Manager (ARM)

SANless SQL Server Clusters in AWS

SANless Linux Cluster in AWS Quick Start

If you are in Azure you may also want to consider Azure Site Recovery (ASR). ASR lets you replicate the entire VM from one Azure region to another region. ASR will replicate your VMs in real-time and allow you to do a non-disruptive DR test whenever you like. It supports most versions of Windows and Linux and is relatively easy to set up.

You can also create replication jobs that have “Multi-VM Consistency”, meaning that servers that must be recovered from the exact same point in time can be put together in this consistency group and they will have the exact same recovery point. What this means is if you wanted to build a SANless cluster with DataKeeper in a single region for high availability you have two options for DR. One is you could extend your SANless cluster to a node in a different region, or else you could simply use ASR to replicate both nodes in a consistency group.

asr

The trade off with ASR is that the RPO and RTO is not as good as you will get with a SANless multi-site cluster, but it is easy to configure and works with just about any application. Just be careful, if your application exceeds 10 MBps in disk write activity on a regular basis ASR will not be able to keep up. Also, clusters based on Storage Spaces Direct cannot be replicated with ASR and in general lack a good DR strategy when used in Azure.

For a while after Managed Disks were released ASR did not fully support them until about a year later. Full support for Managed Disks was a big hurdle for many people looking to use ASR. Fortunately since about February of 2018 ASR fully supports Managed Disks. However, there is another problem that was just introduced.

With the introduction of Availability Zones ASR is once again caught behind the times as they currently don’t support VMs that have been deployed in Availability Zones.

2018-09-25_00-10-24
Support matrix for replicating from one Azure region to another

I went ahead and tried it anyway. I seemed to be able to configure replication and was able to do a test failover.

ASR-and-AZ
I used ASR to replicate SQL1 and SQL3 from Central to East US 2 and did a test failover. Other than not placing the VMs in AZs in East US 2 it seems to work.

I’m hoping to find out more about this limitation at the Ignite conference. I don’t think this limitation is as critical as the Managed Disk limitation was, just because Availability Zones aren’t widely available yet. So hopefully ASR will pick up support for Availability Zones as other regions light up Availability Zones and they are more widely adopted.

 

 

Azure Outage Post-Mortem Part 2

Quick Start Guide: SQL Server Clusters on Windows Server 2008 R2 in Azure

Apparently Windows Server 2008 R2 lives on in the cloud as I get a calls for this sporadically.  Yes, Azure does support Windows Server 2008 R2 and older versions of SQL Server including 2008 R2 and 2012. Of course Always On Availability Groups wasn’t introduced until SQL 2012 and even then you probably want to avoid Availability Groups due to some of the performance issues associated with that version.

If you find yourself needing to support older versions of SQL Server or Windows you will want to build SANless clusters based on SIOS DataKeeper as mentioned in the Azure documentation.

2018-09-14_12-40-55
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sql/virtual-machines-windows-sql-high-availability-dr

I have written many Quick Start Guides over the years, but sometimes I just want to give someone the 10,000 foot overview of the steps just so they have a general idea before they sit down and roll up their sleeves to do an install. Since it is not everyday I’m dealing with Windows 2008 R2 clusters in Azure, I wanted to publish this 10,000 foot overview just to share with my customers.

In a nutshell here are the steps to cluster SQL Server (any version supported on Windows 2008 R2) in Azure.

  • Provision two cluster servers and a file share witness in the same Availability Set. This places all three quorum votes in different Fault and Update Domains.
  • There is a hotfix for SQL 2008 R2 clusters in Azure to enable the listener used by both AGs and FCIs. https://support.microsoft.com/en-us/help/2854082/update-enables-sql-server-availability-group-listeners-on-windows-serv
  • Install that and all other OS updates.
  • Provision the storage on each server.
  • Format NTFS and give drive letters.
  • Each cluster node needs identical storage.
    Enable Failover CLustering and .Net 3.5 Framework on each server
  • Add the servers to the domain
  • Create the basic cluster, but USE POWERSHELL and specify the cluster IP address. If you use the GUI to create the cluster it will get confused and provision a duplicate IP address. If you do it via the GUI you will only be able to connect to the cluster from one of the nodes. If you connect you can correct the problem by specifying a static IP address to be used by the cluster resource.

    Here is an example of the Powershell usage to create the cluster

    New-Cluster -Name cluster1 -Node sql1,sql2 -StaticAddress 10.0.0.101 -NoStorage-
  • Add a File Share Witness to the cluster
  • Install DataKeeper on both cluster nodes
  • Create the DataKeeper Volume Resources and make sure they are Available Storage
  • Install SQL into the cluster as you normally would in a shared storage cluster.
  • Configure the Azure ILB and run the powershell script to update the SQL Cluster IP resource to listen on the Probe Port.

All of this is fully documented on the SIOS documentation page, Deploying DataKeeper Cluster Edition in Azure

Let me know if this helped you or if you have any questions about high availability for SQL Server or disaster recovery in Azure, AWS or Google Cloud.

Quick Start Guide: SQL Server Clusters on Windows Server 2008 R2 in Azure

Lightning Never Strikes Twice: Surviving the #Azure Cloud Outage

Yesterday morning I opened my Twitter feed to find that many people were impacted by an Azure outage. When I tried to access the resource page that described the outage and the current resources impacted even that page was unavailable. @AzureSupport was providing updates via Twitter.

The original update from @AzureSupport came in at 7:12 AM EDT

Azure Outage 2

Looking back on the Twitter feed it seems as if the problem initially began an hour or two before that.

Azure Support 10

It quickly became apparent that the outages had a wider spread impact than just the SOUTH CENTRAL US region as originally reported. It seems as if services that relied on Azure Active Directory could have been impacted as well and customers trying to provision new subscriptions were having issues.

Azure 11

And 24 hours later the problem has not been completely resolved and it according to the last update this morning…

Azure Outage 1

Untitled design (6)

So what could you have done to minimize the impact of this outage? No one can blame Microsoft for a natural disaster such as a lightning strike. But at the end of the day if your only disaster recovery plan is to call, tweet and email Microsoft until the issue is resolved, you just received a rude awakening. IT IS UP TO YOU to ensure you have covered all the bases when it comes to your disaster recovery plan.

While the dust is still settling on exactly what was impacted and what customers could have done to minimize the downtime, here are some of my initial thoughts.

Availability Sets (Fault Domains/Update Domains) – In this scenario, even if you built Failover Clusters, or leveraged Azure Load Balancers and Availability Sets, it seems the entire region went offline so you still would have been out of luck. While it is still recommended to leverage Availability Sets, especially for planned downtime, in this case you still would have been offline.

Availability Zones – While not available in the SOUTH CENTRAL US region yet, it seems that the concept of Availability Zones being rolled out in Azure could have minimized the impact of the outage. Assuming the lightning strike only impacted one datacenter, the other datacenter in the other Availability Zone should have remained operational. However, the outages of the other non-regional services such as Azure Active Directory (AAD) seems to have impacted multiple regions, so I don’t think Availability Zones would have isolated you completely.

Global Load Balancers, Cross Region Failover Clusters, etc. – Whether you are building SANLess clusters that cross regions, or using global load balancers to spread the load across multiple regions, you may have minimized the impact of the outage in SOUTH CENTRAL US, but you may have still been susceptible to the AAD outage.

Hybrid-Cloud, Cross Cloud – About the only way you could guarantee resiliency in a cloud wide failure scenario such as the one Azure just experienced is to have a DR plan that includes having realtime replication of data to a target outside of your primary cloud provider and a plan in place to bring applications online quickly in this other location. These two locations should be entirely independent and should not rely on services from your primary location to be available, such as AAD. The DR location could be another cloud provider, in this case AWS or Google Cloud Platform seem like logical alternatives, or it could be your own datacenter, but that kind of defeats the purpose of running in the cloud in the first place.

Software as a Service – While Software as service such as Azure Active Directory (ADD), Azure SQL Database (Database-as-Service) or one of the many SaaS offerings from any of the cloud providers can seem enticing, you really need to plan for the worst case scenario. Because you are trusting a business critical application to a single vendor you may have very little control in terms of DR options that includes recovery OUTSIDE of the current cloud service provider. I don’t have any words of wisdom here other than investigate your DR options before implementing any SaaS service, and if recovery outside of the cloud is not an option than think long and hard before you sign-up for that service. Minimally make the business stake owners aware that if the cloud service provider has a really bad day and that service is offline there may be nothing you can do about it other than call and complain.

I think in the very near future you will start to hear more and more about cross cloud availability and people leveraging solutions like SIOS DataKeeper to build robust HA and DR strategies that cross cloud providers. Truly cross cloud or hybrid cloud models are the only way to truly insulate yourself from most conceivable cloud outages.

If you were impacted from this latest outage I’d love to hear from you. Tell me what went down, how long you were down, and what you did to recover. What are you planning to do so that in the future your experience is better?

Lightning Never Strikes Twice: Surviving the #Azure Cloud Outage

“Incomplete Communication with Cluster” with local Storage Space for SQL Server cluster

When building a SANless SQL Server cluster with SIOS DataKeeper, or when configuring Always On Availability Groups for SQL Server, you may consider striping together multiple disk in a Simple Storage Space (RAID 0) for performance. This is very commonly done in the cloud where each instance typically his backed by hardware resiliency, so RAID 0 is not really all that risky.

For instance, I had a recent customer in AWS that wanted to max out his IOPS to 80,000, the maximum IOPS currently available to a single instance. Now keep in mind, only the largest EBS optimized instance sizes supports 80,000 IOPS, so you want to make sure you know what maximum IOPS your particular instance size supports.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

In this case we had ac5.18xlarge instance which does support 80,000 IOPS. However, any individual EBS Provisioned IOPS volume only supports up to 32,000 IOPS. The only way to achieve 80,000 IOPS when writing to any single volume is to strip three of these volumes together in a Simple Storage Space.

Herein lies the rub, if you try to do that in an existing cluster things are going to go haywire pretty fast. Fellow MVP Joey D’Antoni recently blogged about the issue and it appears to still be an issue in the Windows Server 2019 preview.

Just as Joey suggests, I always advise my customers to build out the nodes and any Storage Spaces BEFORE they start the clustering process. This makes the process go much smoother. It also allows the customer to have some time to benchmark the server’s performance before they add any replication, to  ensure everything is working as expected.

 

 

“Incomplete Communication with Cluster” with local Storage Space for SQL Server cluster