How to use Azure Site Recovery (ASR) to replicate a Windows Server Failover Cluster (WSFC) that uses SIOS DataKeeper for cluster storage

Intro

So you have built a SQL Server Failover Cluster Instance (FCI), or maybe an SAP ASCS/ERS cluster in Azure. Each node of the cluster resides in a different Availability Zone (AZ), or maybe you have strict latency requirements and are using Placement Proximity Groups (PPG) and your nodes all reside in the same Availability Set. Regardless of the scenario, you now have a much higher level of availability for your business critical application than if you were just running a single instance.

Now that you have high availability (HA) covered, what are you going to do for disaster recovery? Regional disasters that take out multiple AZs are rare, but as recent history has shown us, Mother Nature can really pack a punch. You want to be prepared should an entire region go offline. 

Azure Site Recovery (ASR) is Microsoft’s disaster recovery-as-a-service (DRaaS) offering that allows you to replicate entire VMs from one region to another. It can also replicate virtual machines and physical servers from on-prem into Azure, but for the purpose of this blog post we will focus on the Azure Region-to-Region DR capabilities.

Setting up ASR

We are going to assume you have already built your cluster using SIOS DataKeeper. If not, here are some pointers to help get you started.

Failover Cluster Instances with SQL Server on Azure VMs

SIOS DataKeeper Cluster Edition for the SAP ASCS/SCS cluster share disk

We are also going to assume you are familiar with Azure Site Recovery. Instead of yet another guide on setting up ASR, I suggest you read the latest documentation from Microsoft. This article will focus instead on some things you may not have considered and the specific steps required to fix your cluster after a failover to a different subnet.

Paired Regions

Before you start down the DR path, you should be aware of the concept of Azure Paired Regions. Every Region in Azure has a preferred DR Region. If you want to learn more about Paired Regions, the documentation provides a great background. There are some really good benefits of using your paired region, but it’s really up to you to decide on what region you want to use to host your DR site.

Cloud Witness Location

When you originally built your cluster you had to choose a witness type for your quorum. You may have selected a File Share Witness or a Cloud Witness. Typically either of those witness types should reside in an AZ that is separate from your cluster nodes. 

However, when you consider that, in the event of a disaster, your entire cluster will be running in your DR region, there is a better option. You should use a cloud witness, and place it in your DR region. By placing your cloud witness in your DR region, you provide resiliency not only for local AZ failures, but it also protects you should the entire region fail and you have to use ASR to recover your cluster in the DR region. Through the magic of Dynamic Quorum and Dynamic Witness, you can be sure that even if your DR region goes offline temporarily, it will not impact your production cluster.

Multi-VM Consistency

When using ASR to replicate a cluster, it is important to enable Multi-VM Consistency to ensure that each cluster node’s recovery point is from the same point in time. That ensures that the DataKeeper block level replication occurring between the VMs will be able to continue after recovery without requiring a complete resync.

Crash Consistent Recovery Points

Application consistent recovery points are not supported in replicated clusters. When configuring the ASR replication options do not enable application consistent recovery points.

Keep IP Address After Failover?

When using ASR to replicate to your DR site there is a way to keep the IP address of the VMs the same. Microsoft described it in the article entitled Retain IP addresses during failover. If you can keep the IP address the same after failover it will simplify the recovery process since you won’t have to fix any cluster IP addresses or DataKeeper mirror endpoints, which are based on IP addresses.

However, in my experience, I have never seen anyone actually follow the guidance above, so recovering a cluster in a different subnet will require a few additional steps after recovery before you can bring the cluster online.

Your First Failover Attempt

Recovery Plan

Because you are using Multi-VM Consistency, you have to failover your VMs using a Recovery Plan. The documentation provides pretty straightforward guidance on how to do that. A Recovery Plan groups the VMs you want to recover together to ensure they all failover together. You can even add multiple groups of VMs to the same Recovery Plan to ensure that your entire infrastructure fails over in an orderly fashion.

A Recovery Plan can also launch post recovery scripts to help the failover complete the recovery successfully. The steps I describe below can all be scripted out as part of your Recovery Plan, thereby fully automating the complete recovery process. We will not be covering that process in this blog post, but Microsoft documents this process.

Static IP Addresses

As part of the recovery process you want to make sure the new VMs have static IP addresses. You will have to adjust the interface properties in the Azure Portal so that the VM always uses the same address. If you want to add a public IP address to the interface you should do so at this time as well.

Network Configuration

After the replicated VMs are successfully recovered in the DR site, the first thing you want to do is verify basic connectivity. Is the IP configuration correct? Are the instances using the right DNS server? Is name resolution functioning correctly? Can you ping the remote servers?

If there are any problems with network communications then the rest of the steps described below will be bound to fail. Don’t skip this step!

Load Balancer

As you probably know, clusters in Azure require you to configure a load balancer for client connectivity to work. The load balancer does not fail over as part of the Recovery Plan. You need to build a new load balancer based on the cluster that now resides in this new vNet. You can do this manually or script this as part of your Recovery Plan to happen automatically.

Network Security Groups 

Running in this new subnet also means that you have to specify what Network Security Group you want to apply to these instances. You have to make sure the instances are able to communicate across the required ports. Again, you can do this manually, but it would be better to script this as part of your Recovery Plan.

Fix the IP Cluster Addresses

If you are unable to make the changes described earlier to recover your instances in the same subnet, you will have to complete the following steps to update your cluster IP addresses and the DataKeeper addresses for use in the new subnet.

Every cluster has a core cluster IP address. What you will see if you launch the WSFC UI after a failover is that the cluster won’t be able to connect. This is because the IP address used by the cluster is not valid in the new subnet.

If you open the properties of that IP Address resource you can change the IP address to something that works in the new subnet. Make sure to update the Network and Subnet Mask as well.

Once you fix that IP Address you will have to do the same thing for any other cluster address that you use in your cluster resources.

Fix the DataKeeper Mirror Addresses

SIOS DataKeeper mirrors use IP addresses as mirror endpoints. These are stored in the mirror and mirror job. If you recover a DataKeeper based cluster in a different subnet, you will see that the mirror comes up in a Resync Pending state. You will also notice that the Source IP and the Target IP reflect the original subnet, not the subnet of the DR site.

Fixing this issue involves running a command from SIOS called CHANGEMIRRORENDPOINTS. The usage for CHANGEMIRRORENDPOINTS is as follows.

emcmd <NEW source IP> CHANGEMIRRORENDPOINTS <volume letter> <ORIGINAL target IP> <NEW source IP> <NEW target IP>

In our example, the command and output looked like this.

After the command runs, the DataKeeper GUI will be updated to reflect the new IP addresses as shown below. The mirror will also go to a mirroring state.

Conclusions

You have now successfully configured and tested disaster recovery of your business critical applications using a combination of SIOS DataKeeper for high availability and Azure Site Recovery for disaster recovery. If you have questions please leave me a comment or reach out to me on Twitter @daveberm

How to use Azure Site Recovery (ASR) to replicate a Windows Server Failover Cluster (WSFC) that uses SIOS DataKeeper for cluster storage

Performance of Azure Shared Disk with Zone Redundant Storage (ZRS)

On September 9th, 2021, Microsoft announced the general availability of Zone-Redundant Storage (ZRS) for Azure Disk Storage, including Azure Shared Disk.

What makes this interesting is that you can now build shared storage based failover cluster instances that span Availability Zones (AZ).  With cluster nodes residing in different AZs, users can now qualify for the 99.99% availability SLA. Prior to support for ZRS, Azure Shared Disks only supported Locally Redundant Storage (LRS), limiting cluster deployments to a single AZ, leaving users susceptible to outages should an AZ go offline.

There are however a few limitations to be aware of when deploying an Azure Shared Disk with ZRS.

  • Only supported with premium solid-state drives (SSD) and standard SSDs. Azure Ultra Disks are not supported.
  • Azure Shared Disks with ZRS are currently only available in West US 2, West Europe, North Europe, and France Central regions
  • Disk Caching, both read and write, are not supported with Premium SSD Azure Shared Disks
  • Disk bursting is not available for premium SSD
  • Azure Site Recovery support is not yet available.
  • Azure Backup is available through Azure Disk Backup only.
  • Only server-side encryption is supported, Azure Disk Encryption is not currently supported.

I also found an interesting note in the documentation.

“Except for more write latency, disks using ZRS are identical to disks using LRS, they have the same scale targets. Benchmark your disks to simulate the workload of your application and compare the latency between LRS and ZRS disks.”

While the documentation indicates that ZRS will incur some additional write latency, it is up to the user to determine just how much additional latency they can expect. A link to a disk benchmark document is provided to help guide you in your performance testing.

Following the guidance in the document, I used DiskSpd to measure the additional write latency you might experience. Of course results will vary with workload, disk type, instance size, etc.,but here are my results.

Locally Redundant Storage (LRS)Zone Redundant Storage (ZRS)
Write IOPS5099.824994.63
Average Latency7.8307.998

The DiskSpd test that I ran used the following parameters. 


diskspd -c200G -w100 -b8K -F8 -r -o5 -W30 -d10 -Sh -L testfile.dat

I wrote to a P30 disk with ZRS and a P30 with LRS attached to a Standard DS3 v2 (4 vcpus, 14 GiB memory) instance type. The shared ZRS P30 was also attached to an identical instance in a different AZ and added as shared storage to an empty cluster application.

A 2% overhead seems like a reasonable price to pay to have your data distributed synchronously across two AZs. However, I did wonder what would happen if you moved the clustered application to the remote node, effectively putting your disk in one AZ and your instance in a different AZ. 

Here are the results.

Locally Redundant Storage (LRS)Zone Redundant Storage (ZRS)ZRS when writing from the remote AZ
Write IOPS5099.824994.634079.72
Average Latency7.8307.9989.800

In that scenario I measured a 25% write latency increase. If you experience a complete failure of an AZ, both the storage and the instance will failover to the secondary AZ and you shouldn’t experience this increase in latency at all. However, other failure scenarios that aren’t AZ wide could very well have your clustered application running in one AZ with your Azure Shared Disk running in a different AZ. In those scenarios you will want to move your clustered workload back to a node that resides in the same AZ as your storage as soon as possible to avoid the additional overhead.

Microsoft documents how to initiate a storage account failover to a different region when using GRS, but there is no way to manually initiate the failover of a storage account to a different AZ when using ZRS. You should monitor your failover cluster instance to ensure you are alerted any time a cluster workload moves to a different server and plan to move it back just as soon as it is safe to do so.

You can find yourself in this situation unexpectedly, but it will also certainly happen during planned maintenance of the clustered application servers when you do a rolling update. Awareness is the key to help you minimize the amount of time your storage is performing in a degraded state.

I hope in the future Microsoft allows users to initiate a manual failover of a ZRS disk the same as they do with GRS. The reason they added the feature to GRS was to put the power in the hands of the users in case automatic failover did not happen as expected. In the case of ZRS I could see people wanting to try to tie together storage and application, ensuring they are always running in the same AZ, similar to how host based replication solutions like SIOS DataKeeper do it.

Performance of Azure Shared Disk with Zone Redundant Storage (ZRS)

Clustering SAP #ASCS and #ERS on #AWS using Windows Server Failover Clustering

When ensuring high availability for SAP ASCS and ERS running on WIndows Server, the primary cluster solution you will want to use is Windows Server Failover Clustering. However, when doing this in AWS you will quickly discover that there are a few obstacles you need to know how to overcome when deploying this in AWS.

I recently wrote this Step-by-Step guide that was published on the SAP blog that walks you through the entire process. If you have any questions, please leave a comment.

Clustering SAP #ASCS and #ERS on #AWS using Windows Server Failover Clustering

Using Amazon FSx for SQL Server Failover Cluster Instances – What You Need to Know!

Intro

If you are considering deploying your own Microsoft SQL Server instances in AWS EC2 you have some decisions to make regarding the resiliency of the solution. Sure, AWS will offer you a 99.99% SLA on your Compute resources if you deploy two or more instances across different availability zones. But don’t be fooled, there are a lot of other factors you need to consider when calculating your true application availability. I recently blogged about how to calculate your application availability in the cloud. You probably should have a quick read of that article before you move on.

When it comes to ensuring your Microsoft SQL Server instance is highly available, it really comes down to two basic choices: Always On Availability Group (AG) or SQL Server Failover Cluster Instance (FCI). If you are reading this article I’m making an assumption you are well aware of both of these options and are seriously considering using a SQL Server FCI instead of a SQL Server Always On AG.

Benefits of a Microsoft SQL Server Failover Cluster Instance

The following list summarize what AWS says are the benefits of a SQL Server FCI:

FCI is generally preferable over AG for SQL Server high availability deployments when the following are priority concerns for your use case:

License cost efficiency: You need the Enterprise Edition license of SQL Server to run AGs, whereas you only need the Standard Edition license to run FCIs. This is typically 50–60% less expensive than the Enterprise Edition. Although you can run a Basic version of AGs on Standard Edition starting from SQL Server 2016, it carries the limitation of supporting only one database per AG. This can become a challenge when dealing with applications that require multiple databases like SharePoint.

Instance-level protection versus database-level protection: With FCI, the entire instance is protected – if the primary node becomes unavailable, the entire instance is moved to the standby node. This takes care of the SQL Server logins, SQL Server Agent jobs, certificates, etc. that are stored in the system databases, which are physically stored in shared storage. With AG, on the other hand, only the databases in the group are protected, and system databases cannot be added to an AG – only user databases are allowed. It is the database administrator’s responsibility to replicate changes to system objects on all AG replicas. This leaves the possibility of human error causing the database to become inaccessible to the application.

DTC feature support: If you’re using SQL Server 2012 or 2014, and your application uses Distributed Transaction Coordinator (DTC), you are not able to use an AG as it is not supported. Use FCI in this situation.

https://aws.amazon.com/blogs/storage/simplify-your-microsoft-sql-server-high-availability-deployments-using-amazon-fsx-for-windows-file-server/

Challenges with FCI in the Cloud

Of course, the challenge with building an FCI that spans availability zones is the lack of a shared storage device that is normally required when building a SQL Server FCI. Because the nodes of the cluster are distributed across multiple datacenters, a traditional SAN is not a viable option for shared storage. That leaves us with a two choices for cluster storage: 3rd party storage class resources like SIOS DataKeeper or the new Amazon FSx. Let’s take a look at what you need to know before you make your choice.

Service Level Agreement

As I wrote in how to calculate your application availability, your overall application SLA is only as good as your weakest link. In this case, the FSx SLA of 99.9%.

Normally 99.99% availability represents the starting point of what is considered “highly available”. This is what AWS promises you for your compute resources when two or more are deployed in different availability zones.

In case you didn’t know the difference between three nines and four nines…

  • 99.9% availability allows for 43.83 minutes of downtime per month
  • 99.99% availability allows for only 4.38 minutes of downtime per month

By hosting your cluster storage on FSx despite your 99.99% compute availability, your overall application availability will be 99.9%. In contrast, EBS volumes that span availability zones, such as in a DataKeeper deployment, qualifies for the 99.99% SLA at both the storage and compute layers, meaning your overall application availability is 99.99%.

Storage Location

When configuring FSx for high availability, you will want to enable multi-AZ support. By enabling multi-AZ you have an effectively have a “preferred” AZ and a “standby” AZ. When you deploy your SQL Server FCI nodes you will want to distribute those nodes across the same AZs.

Now in normal situations, you will want to make sure the active cluster node resides in the same AZ as the preferred FSx storage node. This is to minimize the distance and latency to your storage, but also to minimize the costs associated with data transfer across AZs. As specified in the FSx price guide, “Standard data transfer fees apply for inter-AZ or inter-region access to file systems.”

In the unfortunate circumstance where you have a SQL Server FCI failure, but not a FSx failure, there is no mechanism to tie both the storage and compute together. In the event that FSx fails over, it will automatically fail back to the primary availability zone. However, best practices dictate SQL FCI remain running on the secondary node until root cause analysis is performed and fail back is typically scheduled to occur during maintenance periods. This leaves you in a situation where your storage resides in a different AZ, which will incur additional costs. Currently the cost for transferring data across AZs, both ingress and egress, is $0.01/GB.

Without keeping a close eye on the state of your FSx and SQL Server FCI, you may not even be aware that they are running in different regions until you see the data transfer charge at the end of the month.

In contrast, in a configuration that use SIOS DataKeeper, the storage failover is part of the SQL Server FCI recovery, ensuring that the storage always fails over with the SQL Server instance. This ensures SQL Server is always reading and writing to the EBS volumes that are directly attached to the active node. Keep in mind, DataKeeper will incur a data transfer cost associated with write operations which are replicated between AZs or regions. This data transfer cost can be minimized with the use of compression available in DataKeeper.

Controlling Failover

In an FSx multi-subnet configuration there is a preferred availability zone and a standby availability. Should the FSx file server in the preferred availability zone experience a failure, the file server in the standby AZ will recover. AWS reports that this recovery time takes about 30 seconds with standard shares. With the use of continuously available file shares Microsoft reports this failover time can be closer to 15 seconds. During this failover time, there is a brownout that occurs where reads and writes are paused, but will continue once recovery completes.

FSx multi-site has automatic failback enabled, meaning that for every unplanned failover of FSx, you also have an unplanned failback. In contrast, typically when a SQL Server FCI experience an unplanned failover you would either just leave it running on the secondary or schedule a failback after hours or during the next maintenance period.

SQL Server Analysis Services Cluster Not Supported with FSx

If you want to cluster SSAS, you will need a clustered disk resource like SIOS DataKeeper. The How to Cluster SQL Server Analysis Server white paper clearly states that SMB cannot be used and that cluster drives with drive letters must be used. In contrast, the DataKeeper Volume resource presents itself as a clustered disk and can be used with SSAS.

Summary

While FSx certainly can make sense for typical SMB uses like Windows user files and other non-critical services where 99.9% availability SLA suffices, FSx is an excellent option If you application requires high availability (99.99%) or HA/DR solutions that also span regions, the SIOS DataKeeper is the right fit.

Using Amazon FSx for SQL Server Failover Cluster Instances – What You Need to Know!

Calculating Application Availability in the Cloud

When deploying business critical applications in the cloud you want to make sure they are highly available. The good news is that if you plan properly, you can achieve 99.99% (4-nines) of availability or more. However, calculating your true availability may not be as straightforward as it seems.

When considering availability you must consider the key components that make access to your application possible, which I’ll call the availability chain. Component of the availability chain are:

  • Compute
  • Network 
  • Storage
  • Application
  • Dependent services

Your application is only as available as your weakest link, and your downtime increases exponentially with each additional link you add to the chain.  Let’s examine each of the links. 

Compute Availability

Each of the three major cloud service providers have some similarities. One thing in common across all three platforms is the service level agreements (SLA) they will commit to for compute.

The SLA for all three public cloud providers for VMs when you have two or more VMs configured across different availability zones is 99.99%. Keep in mind, this SLA only guarantees the remote accessibility of one of the VMs at any given time, it makes no promises as to the availability of the services or application(s) running inside the VM. If you deploy a single VM within a single datacenter, this SLA varies from “90% of each hour” (AWS) to 99.5% (Azure and GCP) or 99.9% (Azure single VM when using Premium SSD).

True high availability starts at 99.99%, so the first step is to ensure your application is available is to make sure the application is distributed across two or more VMs that span availability zones. With two VMs spread across two availability zones, giving you 99.99% availability of at least one of those VMs, you could theorize that if you had three VMs spread across three availability zones your availability would be even greater than 99.99%. Although the cloud providers’ SLA will never guarantee beyond 99.99% availability regardless of the number of availability zones in use, if you use pure statistics you might come to the conclusion that your availability could jump to as high as 99.999999% or 8-nines of availability, 26.30 milliseconds downtime per month.

1-(.0001*.0001) = .99999999

99.999999% availability with three availability zones?

Don’t go around quoting that number, but just keep in mind that it makes sense that if two availability zones can give you 99.99% availability, it stands to reason that three availability zones is going to give you something significantly more than 99.99% availability.

Compute is just one link in the availability chain. We still have to address network, storage and other dependent services, which all represent possible points of failure.

Network Availability

In order for your application to be available, every network hop between the client and the application and all the resources that the application depends on, must be available and working within tolerable latency ranges. You need to understand the network links between database servers, application servers, web servers and clients to know precisely where the network might fail. And remember, the more links in your availability chain the lower your overall availability will be.

Although network availability betweens VMs in the same vNet are covered under the standard compute SLA, there are other network services that you may be utilizing. Here are just a few examples of network services you could be utilizing which would impact overall application availability.

Express Route – 99.95%
VPN Gateway – 99.9% through 99.95%
Load Balancer – 99.99%
Traffic Manager – 99.99%
Elastic Load Balancer – 99.99%
Direct Connect – 99.9% – 99.99%

Building on what we have learned so far, let’s take a look at the availability of an application that is deployed across two availability zones. 

99.99% compute availability

99.99% load balancer availability

.9999 * .9999 = .9998

99.98% availability = ~9 minutes downtime per month

Now that we have addressed compute and network availability, let’s move on to storage.

Storage Availability

Now here is where the story gets a little hairy. Have a look at the following storage SLAs

https://azure.microsoft.com/en-us/support/legal/sla/storage/v1_5/

https://cloud.google.com/storage/sla

https://aws.amazon.com/compute/sla/

It seems pretty clear that Azure and Google are giving you a 99.9% SLA on block storage solutions. AWS doesn’t mention EBS specifically here. They only talk about VMs and measure their single instance VMs availability by the hour instead of by the month as the other cloud providers do. For sake of discussion, lets use the 99.9% availability guarantee that both Azure and GCP have published.

Building upon our previous example, let’s add some storage to the equation.

99.99% compute availability

99.99% load balancer availability

99.9% managed disk

.9999 * .9999 * .999 = .9988

99.88% availability = ~53 minutes of downtime per month.

53 minutes of downtime is a lot more than the 9 minutes of downtime we calculated in our previous example. What can we do to minimize the impact of the 99.9% storage availability? We have to build more redundancy in the storage! 

Fortunately, we usually include storage redundancy when planning for application availability. For instance, when we stand up web servers, each web server will typically store data on the locally attached disk. When deploying domain controllers, Microsoft Active Directory takes care of replicating AD information across all the domain controllers. In the case of something like SQL Server, we leverage things Always On Availability Groups or SIOS DataKeeper to keep the data in sync across locally attached disks.

The more copies of the data we have distributed across different availability zones, the more likely we will be able to survive a failure.

For example, an application that stores its data across two different disks in different availability zones will benefit from the redundancy and instead of 99.9% availability it is more likely to achieve 99.9999% availability of the storage.

1 – (.001 * .001) = .999999

If we throw that into the previous equation the picture starts to look a little brighter.

.9999 * .9999 * .999999 = .9998

99.98% availability = ~9 minutes of downtime

By duplicating the data across multiple AZs, and therefore multiple disks, we have effectively mitigated the downtime associated with cloud storage.

Application and Dependent Services Availability

You’ve done all you can do to ensure compute, network, and storage availability. But what about the application itself? Some applications can scale out and provide redundancy by load balancing between multiple instances of the same application. Think of your typical web server farm where you may typically load balance web requests between five servers. If you lose one server the load balancer simply removes it from it’s rotation until it is once again responsive.

Other applications require a little more care and monitoring. Take SQL Server for instance. Typically Always On Availability Groups or Failover Cluster Instances are used to monitor database availability and take recovery actions should a database become unresponsive due to application or system level failures. While there is no published SLA for SQL Server availability solutions, it is commonly accepted that when configured properly for high availability, a SQL Server can provide 99.99% availability.

You may rely on other cloud based services, like hosted Active Directory, hosted DNS, microservices, or even the availability of the cloud portal itself should all be factored into your overall availability equation.

Summary

Application availability is the sum of all the moving parts. Skimping in just one area can exponentially impact the overall availability of your application. Take your time and investigate all the links in your availability chain for weakness including compute, network, storage, application and dependent services.

In general the numbers presented here are hopefully worst case scenarios and your actual availability should exceed the published SLAs. Do your homework and be wary of any service that can not guarantee 99.99% availability, the typical threshold of what is considered highly available.

Human error and security were not addressed in this article. You can make your application as highly available as possible, but if you have not taken steps to secure your application against external threats and stupid human mistakes then all bets are off when it comes to availability.

Calculating Application Availability in the Cloud

Step-by-Step: iSCSI Target Server Cluster in Azure

I recently helped someone build an iSCSI target server cluster in Azure and realized that I never wrote a step-by-step guide for that particular configuration. So to remedy that, here are the step-by-step instructions in case you need to do this yourself.

Pre-requisites

I’m going to assume you are fairly familiar with Azure and Windows Server, so I’m going to spare you some of the details. Let’s assume you have at least done the following as a pre-requisite

  • Provision two servers (SQL1, SQL2) each in a different Availability Zone (Availability Set is also possible, but Availability Zones have a better SLA)
  • Assign static IP addresses to them through the Azure portal
  • Joined the servers to an existing domain
  • Enabled the Failover Clustering feature and the iSCSI Target Server feature on both nodes
  • Add three Azure Premium Disk to each node.
    NOTE: this is optional, one disk is the minimum required. For increased IOPS we are going to stripe three Premium Azure Disks together in a storage pool and create a simple (RAID 0) virtual disk
  • SIOS DataKeeper is going to be used to provided the replicated storage used in the cluster. If you need DataKeeper you can request a trial here.

Create local Storage Pool

Once again, this step is completely optional, but for increased IOPS we are going to stripe together three Azure Premium Disks into a single Storage Pool. You might be tempted to use Dynamic Disk and a spanned volume instead, but don’t do that! If you use dynamic disks you will find out that there is some general incompatibility that will prevent you from creating iSCSI targets later.

Don’t worry, creating a local Storage Pool is pretty straight forward if you are aware of the pitfalls you might encounter as described below. The official documentation can be found here.

Pitfall #1 – although the documentation says the minimum size for a volume to be used in a storage pool is 4 GB, I found that the P1 Premium Disk (4GB) was NOT recognized. So in my lab I used 16GB P3 Premium Disks.

Pitfall #2 – you HAVE to have at least three disks to create a Storage Pool.

Pitfall #3 – create your Storage Pool before you create your cluster. If you try to do it after you create your cluster you are going to wind up with a big mess as Microsoft tries to create a clustered storage pool for you. We are NOT going to create a clustered storage pool, so avoid that mess by creating your Storage Pool before you create the cluster. If you have to add a Storage Pool after the cluster is created you will first have to evict the node from the cluster, then create the Storage Pool.

Based on the documentation found here, below are the screenshots that represent what you should see when you build your local Storage Pool on each of the two cluster nodes. Complete these steps on both servers BEFORE you build the cluster.

You should see the Primordial pool on both servers.
Right-click and choose New Storage Pool…
Choose Create a virtual disk when this wizard closes
Notice here you could create storage tiers if you decided to use a combination of Standard, Premium and Ultra SSD
For best performance use Simple storage layout (RAID 0). Don’t be concerned about reliability since Azure Managed Disks have triple redundancy on the backend. Simple is required for optimal performance.
For performance purposes use Fixed provisioning. You are already paying for the full Premium disk anyway, so no need not to use it all.
Now you will have a 45 GB X drive on your first node. Repeat this entire process for the second node.

Create your Cluster

Now that each server each have their own 45 GB X drive, we are going to create the basic cluster. Creating a cluster in Azure is best done via Powershell so that we can specify a static IP address. If you do it through the GUI you will soon realize that Azure assigns your cluster IP a duplicate IP address that you will have to clean up, so don’t do that!

Here is an example Powershell code to create a new cluster.

 New-Cluster -Name mycluster -NoStorage -StaticAddress 10.0.0.100 -Node sql1, sql2

The output will look something like this.

PS C:\Users\dave.DATAKEEPER> New-Cluster -Name mycluster -NoStorage -StaticAddress 10.0.0.100 -Node sql1, sql2
WARNING: There were issues while creating the clustered role that may prevent it from starting. For more information view the report file below.
WARNING: Report file location: C:\windows\cluster\Reports\Create Cluster Wizard mycluster on 2020.05.20 At 16.54.45.htm

Name     
----     
mycluster

The warning in the report will tell you that there is no witness. Because there is no shared storage in this cluster you will have to create either a Cloud Witness or a File Share Witness. I’m not going to walk you through that process as it is pretty well documented at those links.

Don’t put this off, go ahead and create the witness now before you move to the next step!

You now should have a basic 2-node cluster that looks something like this.

Configure a Load Balancer for the Cluster Core IP Address

Clusters in Azure are unique in that the Azure virtual network does not support gratuitous ARP. Don’t worry if you don’t know what that means, all you have to really know is that cluster IP addresses can’t be reached directly. Instead, you have to use an Azure Load Balancer, which redirects the client connection to the active cluster node.

There are two steps to getting a load balancer configured for a cluster in Azure. The first step is to create the load balancer. The second step is to update the cluster IP address so that it listens for the load balancer’s health probe and uses a 255.255.255.255 subnet mask which enables you to avoid IP address conflicts with the ILB.

We will first create a load balancer for the cluster core IP address. Later we will edit the load balancer to also address the iSCSI cluster resource IP address that we will be created at the end of this document.

Step 1 – Create a Standard Load Balancer

Notice that the static IP address we are using is the same address that we used to create the core cluster IP resource.

Once the load balancer is created you will edit the load balancer as shown below

Add the two cluster nodes to the backend pool
Add a health probe. In this example we use 59999 as the port. Remember that port, we will need it in the next step.
Create a new rue to redirect all HA ports, Make sure Floating IP is enabled.

Step 2 – Edit to cluster core IP address to work with the load balancer

As I mentioned earlier, there are two steps to getting the load balancer configured to work properly. Now that we have a load balancer, we have to run a Powershell script on one of the cluster nodes. The following is an example script that needs to be run on one of the cluster nodes.

$ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “Cluster IP Address” 
$ILBIP = “10.0.0.100” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59998;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0} 

The important thing about the script above, besides getting all the variables correct for your environment, is making sure the ProbePort is set to the same port you defined in your load balancer settings for this particular IP address. You will see later that we will create a 2nd health probe for the iSCSI cluster IP resource that will use a different port. The other important thing is making sure you leave the subnet set at 255.255.255.255. It may look wrong, but that is what it needs to be set to.

After you run it the output should look like this.

 PS C:\Users\dave.DATAKEEPER> $ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “Cluster IP Address” 
$ILBIP = “10.0.0.100” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59999;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0}
WARNING: The properties were stored, but not all changes will take effect until Cluster IP Address is taken offline and then online again.

You will need to take the core cluster IP resource offline and bring it back online again before it will function properly with the load balancer.

Assuming you did everything right in creating your load balancer, your Server Manager on both servers should list your cluster as Online as shown below.

Check Server Manager on both cluster nodes. Your cluster should show as “Online” under Manageability.

Install DataKeeper

I won’t go through all the steps here, but basically at this point you are ready to install SIOS DataKeeper on both cluster nodes. It’s a pretty simple setup, just run the setup and choose all the defaults. If you run into any problems with DataKeeper it is usually one of two things. The first issue is the service account. You need to make sure the account you are using to run the DataKeeper service is in the Local Administrators Group on each node.

The second issue is in regards to firewalls. Although the DataKeeper install will update the local Windows Firewall automatically, if your network is locked down you will need to make sure the cluster nodes can communicate with each other across the required DataKeeper ports. In addition, you need to make sure the ILB health probe can reach your servers.

Once DataKeeper is installed you are ready to create your first DataKeeper job. Complete the following steps for each volume you want to replicate using the DataKeeper interface

Use the DataKeeper interface to connect to both servers
Click on create new job and give it a name
Click Yes to register the DataKeeper volume in the cluster
Once the volume is registered it will appear in Available Storage in Failover CLuster Manager

Create the iSCSI target server cluster

In this next step we will create the iSCSI target server role in our cluster. In an ideal world I would have a Powershell script that does all this for you, but for sake of time for now I’m just going to show you how to do it through the GUI. If you happen to write the Powershell code please feel free to share with the rest of us!

There is one problem with the GUI method. ou will wind up with a duplicate IP address in when the IP Resource is created, which will cause your cluster resource to fail until we fix it. I’ll walk through that process as well.

Go to the Properties of the failed IP Address resource and choose Static IP and select an IP address that is not in use on your network. Remember this address, we will use it in our next step when we update the load balancer.

You should now be able to bring the iSCSI cluster resource online.

Update load balancer for iSCSI target server cluster resource

Like I mentioned earlier, clients can’t connect directly to the cluster IP address (10.0.0.110) we just created for the iSCSI target server cluster. We will have to update the load balancer we created earlier as shown below.

Start by adding a new frontend IP address that uses the same IP address that the iSCSI Target cluster IP resource uses.
Add a second health probe on a different port. Remember this port number, we will use it again in the powershell script we run next
We add one more load balancing rule. Make sure to change the Frontend IP address and Health probe to use the ones we just created. Also make sure direct server return is enabled.

The final step to allow the load balancer to work is to run the following Powershell script on one of the cluster nodes. Make sure you use the new Healthprobe port, IP address and IP Resource name.

 $ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “IP Address 10.0.0.0” 
$ILBIP = “10.0.0.110” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59998;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0} 

Your output should look like this.

 PS C:\Users\dave.DATAKEEPER> $ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “IP Address 10.0.0.0” 
$ILBIP = “10.0.0.110” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59998;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0}
WARNING: The properties were stored, but not all changes will take effect until IP Address 10.0.0.0 is taken offline and then online again.

Make sure to take the resource offline and online for the settings to take effect.

Create your clustered iSCSI targets

Before you begin, it is best to check to make sure Server Manager from BOTH servers can see the two cluster nodes, plus the two cluster name resources, and they both appear “Online” under manageability as shown below.

If either server has an issue querying either of those cluster names then the next steps will fail. If there is a problem I would double check all the steps you took to create the load balancer and the Powershell scripts you ran.

We are now ready to create our first clustered iSCSI targets. From either of the cluster nodes, follows the steps illustrated below as an example on how to create iSCSI targets.

Of course assign this to whichever server or servers will be connecting to this iSSI target.

And there you have it, you now have a functioning iSCSI target server in Azure.

If you build this leave a comment and me know how you plan to use it!

Step-by-Step: iSCSI Target Server Cluster in Azure

Achieving application consistent recovery points of SQL Server 2008 R2 with Azure Site Recovery in Azure

If you want to use ASR to replicate SQL Server 2008 R2 standalone or clustered instances, you will need to update the SQL Writer  to 2012 or later.

You can use SQL express version as it is a free download.

https://www.microsoft.com/en-us/download/details.aspx?id=29062

Once downloaded, navigate to the download location and run the executable with /x.  This will give you an option to specify a location to extract the files to.

ENU\x64\SQLEXPRADV_x64_ENU.exe /x

Once the extraction completes, navigate to the extracted location and the following location:

SQL\1033_enu_lp\x64\setup\x64

Within that folder you should find SQLWriter.msi.  Run this on the system where you want to update the SQL writer.

You now will be able to use ASR to do application consistent recovery points of SQL Server 2008 R2.

Achieving application consistent recovery points of SQL Server 2008 R2 with Azure Site Recovery in Azure

Major Cloud Outage impacts Google Compute Engine – Were you prepared?

Google first reported an “Issue” on Jun 2, 2019 at 12:25 PDT. As is now common in any type of disaster, reports of this outage first appeared on social media. It seems now the most reliable place to get any type of information early in a disaster is social media.

Twitter is quickly becoming the first source of information on anything from revolutions, natural disasters to cloud outages.

Many services that rely on Google Compute Engine were impacted. With three teenage kids at home, I first knew something was up when all three kids emerged from their caves, aka, bedrooms, at the same time with a worried look on their faces. Snapchat, Youtube and Discord were all offline!

They must have thought that surely this was the first sign of the apocalypse. I reassured them that this was not the beginning of the new dark ages and that maybe they should go outside and do some yard work instead. That scared them back to reality and they quickly scurried away to find something else to occupy their time.

All kidding aside, there were many services being reported as down, or only available in certain areas. The dust is still settling on the cause, breadth and scope of the outage, but it certainly seems that the outage was pretty significant in size and scope, impacting many customers and services including Gmail and other G-Suite services, Vimeo and more.

Many services were impacted by this outage, Gmail, YouTube and SnapChat just to name a few.

While we wait for the official root cause analysis on this latest Google Compute Engine outage, Google reported “high levels of network congestion in the eastern USA” caused the downtime. We will have to wait to see what they determine caused the network issues. Was it human error, cyber-attack, hardware failure, or something else?

Were you prepared?

As I wrote during the last major cloud outage, if you are running business critical workloads in the cloud, regardless of the cloud service provider, it is incumbent upon you to plan for the inevitable outage. The multi-day Azure outage of Sept 4th, 2018 was related to a failure of the secondary HVAC system to kick in during a power surge related to an electrical storm. While the failure was just within a single datacenter, the outage exposed multiple services that had dependencies on this single datacenter, making that datacenter itself a single point of failure.

Leveraging the cloud’s infrastructure, you can minimize your risks by continuously replicating critical data between Availability Zones, Regions or even cloud service providers. In addition to data protection, having a procedure in place to rapidly recover business critical applications is an essential part of any disaster recovery plan. There are various replication and recovery options available, from services provided by the cloud vendor themselves like Azure Site Recovery, to application specific solutions like SQL Server Always On Availability Groups, to third party solutions like SIOS DataKeeper that protect a wide range of applications running on both Windows and Linux.

Having a disaster recovery strategy that is wholly dependent on a single cloud provider leaves you susceptible to a scenario that might impact multiple regions within a single cloud. Multi-datacenter or multi-region disasters are not likely. However, as we saw with this recent outage and the Azure outage last fall, even if a failure is local to a single datacenter, the impact can be wide reaching across multiple datacenters or even regions within a cloud. To minimize your risks, may want to consider a multi-cloud or hybrid cloud scenario where your disaster recovery site resides outside of your primary cloud platform.

The cloud is just as susceptible to outages as your own datacenter. You must take steps to prepare for disasters. I suggest you start by looking at your most business critical apps first. What would you do if they were offline and the cloud portal to manage them was not even available? Could you recover? Would you meet your RTO and RPO objectives? If not, maybe it is time to re-evaluate your DR strategy.

“By failing to prepare, you are preparing to fail.”

Benjamin Franklin

Major Cloud Outage impacts Google Compute Engine – Were you prepared?

New Azure “SQL Server settings” blade in the Azure Portal

I just noticed today that there is a new blade in the Azure portal when creating a new SQL Server virtual machine. I’ve been looking for an announcement regarding this new Azure portal experience, but I haven’t found one yet. This feature wasn’t available when I took the screen shots for my last post on creating a SQL Server 2008 R2 FCI in Azure on April 19th, so it must be relatively new.

New Azure “SQL Server Settings” blade on the Azure portal

Most of the settings are pretty self explanatory. Under Security and Networking you can specify the port you want SQL to listen on. It also appears as if the Azure Security Group will be updated to allow different levels of access to the SQL instance: Local, Private or Public. Authentication options are also exposed in this new SQL Server settings blade.

Security, Networking and Authentication options are part of your SQL Server deployment

The rest of the features include licensing, patching and backup options. In addition, if you are deploying the Enterprise Edition of SQL Server 2016 or later you also have the option to enable SQL Server R Services for advanced analytics.

Licensing, Patching, Backup and R Services options can be automatically configured

All of those options are welcome additions to the Azure portal experience when provisioning a new SQL Server instance. I’m sure the seasoned DBA probably has a list of a few dozen other options they would like to tweak before a SQL Server deployment, but this is certainly a step in the right direction.

Storage Configuration Options

The most interesting new feature I have found on this blade is the Storage Configuration option.

When you click on Change Configuration, you get the following blade.

As you slide the IOPS slider to the right you will see the number of data disks increase, the Storage Size increase, and the Throughput increase. You will be limited to the max number of IOPS and disks supported by that instance size. You see in the screenshot below I am able to go as high as 80,000 IOPS when provisioning storage for a Standard E64-16s_v3 instance.

The Standard E64-16s_v3 instance size supports up to 80,000 IOPS

There is also a “Storage optimization” option. I haven’t tried all the different combinations to know exactly what the Storage optimization setting does. If you know how the different options change the storage configuration, leave me a comment, or we will just wait for the official documentation to be released.

For my test, I provisioned a Standard DS13 v2 instance and maxed out the IOPS at 25600, the max IOPS for that instance size. I also optimized the storage for Transactional processing.

I found that when this instance was provisioned, six P30 premium disk were attached to the instance. This makes sense, since each P30 delivers 5000 IOPS, so it would take at least six of them to deliver the 25,600 IOPS requested. This also increased the Storage Size to 6 TB, since each P30 gives you one 1 TB of storage space. The Read-only host caching was also enabled on these disks.

The six disks were automatically provisioned and attached to the instance

When I logged in to the instance to see what Azure had done with those disk I found that they had done exactly what I would have done; they created a single Storage Pool with the six P30 disks and created a Simple (aka, RAID 0) Storage Space and provisioned a single 6 TB F:\ drive.

This storage configuration wizard validates some of the cloud storage assumptions I made in my previous blog post, Storage Considerations for Running SQL Server in Azure. It seems like a single, large disk should suffice in most circumstances.

A Simple Storage Space consisting of the six P30s are presented as a single F:\ drive

I have found this storage optimization is not available in every Azure Marketplace offering. For example, if you are moving SQL Server 2008 R2 to Azure for the extended security updates you will find that this storage optimization in not available in the SQL2008R2/Windows Server 2008 R2 Azure Marketplace image. Of course, Storage Spaces was not introduced until Windows Server 2012, so that makes sense. I did verify that this option is available with the SQL Server 2012 SP4 on Windows Server 2012 R2 Azure Marketplace offering.

There is a minor inconvenience however. In addition to adding this new Storage configuration option on SQL Server settings blade, they also removed the option to add Data Disks on the Disks blade. So if I wanted to provision additional storage without creating a Storage Space, I would have to create the instance first and then come back and add Data disks after it the virtual machine is provisioned.

Final thoughts

All of the SQL Server configuration options in this new Azure blade are welcome additions. I would love to see the list tunable settings grow. Information text should include guidance on current best practices for each tunable.

What SQL Server or Windows OS tunables would you like to see exposed as part of the provisioning process to make your life as a DBA easier? Not only would these tunables make your life easier, but they would also make the junior DBA look like a season pro by guiding them through all the current SQL Server configuration best practices.

I think the new Storage configuration option is probably the most compelling new addition. Prior to the Storage configuration wizard, users had to be aware of the limits of their instance size, the limits of the storage they were adding, and have the wherewithal to stripe together multiple disks in a Simple Storage Space to get the maximum IOPS. A few years ago I put together a simple Azure Storage Calculator to help people make these decisions. My calculator is currently outdated, but this new Storage configuration option may make it obsolete anyway.

I would love to see this Storage configuration wizard included as a standard offering in the Disks blade of every Windows instance type, rather than just the SQL Server instances. I would let the user choose to use the new Storage configuration  “Wizard” experience, or the “Classic” experience where you manually add and manage storage.

New Azure “SQL Server settings” blade in the Azure Portal

Microsoft Build 2019 Announcements and Sessions on Demand

If you are like me and were unable to get away from the office to attend Microsoft Build 2019 you will be glad to know that Microsoft has published all the sessions and are available online at no charge.

Being a developer focused conference most of the announcements were geared towards developers. You can see a complete list of searchable announcements here. https://azure.microsoft.com/en-us/updates/?updatetype=microsoft-build&Page=1

I’m more of an infrastructure guy, so some of the more interesting announcements to me were the following.

Azure VMware Solutions is now generally available – I guess if I was heavily invested in VMware and was looking to expand into Azure this announcement would certainly open up some interesting possibilities. It looks like if I use a bare-metal instance or dedicated instance I can basically have an ESX host running in Azure. I guess if you are doing a hybrid-cloud deployment and want to easily move workloads back and forth between on-prem and Azure his makes sense. If this excites you leave me a comment telling me why.

Azure Quickstart Center enables new customers to build cloud projects with confidence – I haven’t looked into this yet, but this looks VERY interesting to me if it is what I think it is. As cloud adoption continues to grow so does the required skill set of the IT professional. Infrastructure as Code (IaC) is one of those skills the wise IT professional will want to become comfortable with.

Just two years ago I would talk to customers and this skill set was non-existent, or really just something the largest IT consultants or cloud providers had any experience in. Over the past year I have seen this skill set become more common with the customers I work with and is often the preferred deployment method. The IaS technology has also matured over that period.

I predict that if you don’t already manage your cloud deployment with IaC you will be in the near future. I’m hopeful that this new offering from Microsoft could be a great intro to those IT professionals looking to gain some IaC experience and knowledge. After I have had a look at it I’ll post a follow-up article.

Microsoft Build 2019 Announcements and Sessions on Demand