Extending a SQL Server Failover Cluster Across Regions in Google Cloud Platform (GCP)

October 29, 2024 davebermCloud, Cluster, database, GCP, Google SQL Cluster, SQL, sql-server1 Comment

I was the principal author of this SIOS whitepaper, which describes how to build a 2-node SQL Server cluster in Google Cloud Platform (GCP) spanning multiple zones. Today, I’ll explain how to extend this cluster by adding a third node in a different GCP region.

Assuming you’ve completed all the steps in the referenced documentation, here’s how to proceed:

1. Create Subnet

Create a new subnet in the additional region.

gcloud compute networks subnets create wsfcnetsub1 \
    --network wsfcnet \
    --region us-west1 \
    --range 10.1.0.0/16

2. Configure Firewall for Internal Communication

Open up firewall rules to allow internal communication on the new subnet.

gcloud compute firewall-rules create allow-all-subnet-10-1 \
    --network wsfcnet \
    --allow all \
    --source-ranges 10.1.0.0/16

3. Verify Available Zones

Check which zones exist in the new region (e.g., us-west1).

gcloud compute zones list \
    --filter="region:(us-west1)" \
    --format="value(name)"

4. Check VM Types Supporting Local SSD

Find the VM types that support local SSD in your target zone.

gcloud compute machine-types list \
    --zones us-west1-a \
    --filter="guestCpus>0 AND name:(n2 n2-highmem n2-highcpu c2 e2-custom e2-custom-memory e2-custom-cpu)" \
    --sort-by=guestCpus

5. Create New VM Instance

Create the new VM in the chosen zone with appropriate settings.

gcloud compute instances create wsfc-3 \
 --zone us-west1-a \
 --machine-type n2-standard-2 \
 --image-project windows-cloud \
 --image-family windows-2016 \
 --scopes https://www.googleapis.com/auth/compute \
 --can-ip-forward \
 --private-network-ip 10.1.0.4 \
 --network wsfcnet \
 --subnet wsfcnetsub1 \
 --metadata enable-wsfc=true \
 --local-ssd interface=nvme \
 --create-disk=auto-delete=yes,boot=no,device-name=extradisk1,size=1,type=pd-standard

6. Set Windows Password

Set a Windows password for accessing the new VM.

gcloud compute reset-windows-password wsfc-3 \
    --zone us-west1-a \
    --user david

7. Configure Windows Networking

On wsfc-3, run these PowerShell commands to configure the network.

# Remove existing DHCP configuration
Remove-NetIPAddress -InterfaceAlias "Ethernet 2" -AddressFamily IPv4 -Confirm:$false
# Set static IP
New-NetIPAddress -InterfaceAlias "Ethernet 2" -IPAddress 10.1.0.4 -PrefixLength 16 -DefaultGateway 10.1.0.1
# Set DNS server
Set-DnsClientServerAddress -InterfaceAlias "Ethernet 2" -ServerAddresses 10.0.0.6

8. Bring Extra Storage Online

Run these PowerShell commands to initialize and format the additional storage.

# Identify the new disk
$disks = Get-Disk | Where-Object IsOffline -Eq $false
$newDisk = $disks | Where-Object PartitionStyle -Eq 'RAW'
# Initialize and format new disk
Initialize-Disk -Number $newDisk.Number -PartitionStyle MBR
New-Partition -DiskNumber $newDisk.Number -UseMaximumSize -AssignDriveLetter | Format-Volume -FileSystem NTFS -NewFileSystemLabel "ExtraDisk"

9. Ensure Local SSD Attaches on Startup

Use this script to format and mount the local SSD on startup. Call it AttachSSD.ps1 and save it here. C:\Scripts\AttachSSD.ps1

$diskNumber = 0
$driveLetter = "F"
# Check drive letter availability
if (Get-PSDrive -Name $driveLetter -ErrorAction SilentlyContinue) {
    exit
}
# Initialize and format if RAW
$disk = Get-Disk -Number $diskNumber -ErrorAction SilentlyContinue
while (-not $disk) {

    Start-Sleep -Seconds 5

    $disk = Get-Disk -Number $diskNumber -ErrorAction SilentlyContinue

}
if ($disk.PartitionStyle -eq 'RAW') {

    Initialize-Disk -Number $diskNumber -PartitionStyle GPT

}

$existingPartition = Get-Partition -DiskNumber $diskNumber -ErrorAction SilentlyContinue

if (-not $existingPartition) {

    $partition = New-Partition -DiskNumber $diskNumber -UseMaximumSize -AssignDriveLetter

    Format-Volume -DriveLetter $partition.DriveLetter -FileSystem NTFS -NewFileSystemLabel "LocalSSD"

    Set-Partition -DriveLetter $partition.DriveLetter -NewDriveLetter $driveLetter

}

Schedule Script to Run at Startup:

$taskName = "InitializeLocalSSD"
$scriptPath = "C:\Scripts\AttachSSD.ps1"
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -ExecutionPolicy Bypass -File `"$scriptPath`""

$trigger = New-ScheduledTaskTrigger -AtStartup

$principal = New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest

Register-ScheduledTask -TaskName $taskName -Action $action -Trigger $trigger -Principal $principal

10. Join the Domain

Join the new server to the domain.

$domain = "datakeeper.local"
$username = "administrator@datakeeper.local"

$password = "YourPassword!"

$securePassword = ConvertTo-SecureString $password -AsPlainText -Force

$credential = New-Object System.Management.Automation.PSCredential ($username, $securePassword)

Add-Computer -DomainName $domain -Credential $credential -Force

11. Add a new mirror to the existing Datakeeper job

Follow the screenshots below to extend the existing DataKeeper job to the new server.

12. Install Windows Failover Clustering and add to the cluster

Install clustering tools and add the node to the cluster.

Install-WindowsFeature -Name Failover-Clustering -IncludeManagementTools
Add-ClusterNode -Cluster fileserver.datakeeper.local -Name wsfc-3

Test-Cluster -Name fileserver.datakeeper.local

Note: Adding the node through the WSFC UI may be necessary if there’s a load balancer issue with the cluster name.

13. Install SQL Server on the new cluster node

Run SQL Server setup and choose add node to existing cluster installation option. https://learn.microsoft.com/en-us/sql/sql-server/failover-clusters/install/add-or-remove-nodes-in-a-sql-server-failover-cluster-setup?view=sql-server-ver16

14. Create New Load Balancer Rules

Configure load balancing to allow access across regions.

gcloud compute instance-groups unmanaged create wsfc-group-3 --zone=us-west1-a

gcloud compute instance-groups unmanaged add-instances wsfc-group-3 --instances wsfc-3 --zone us-west1-a

# Set environment variables
export REGION="us-west1"
export NETWORK="wsfcnet"
export SUBNET="wsfcnetsub1"
export LB_NAME="swfc-lbwest"
export HEALTH_CHECK_NAME="wsfc-hc-west"
export HEALTH_CHECK_PORT="59998"
export BACKEND_PORTS="1433"
export BACKEND_IP="10.1.0.9"
export BACKEND_RESPONSE="1"
export LB_INTERNAL_IP="10.1.0.9"

# Step 1: Create Health Check
gcloud compute health-checks create tcp $HEALTH_CHECK_NAME \
 --region=$REGION \
 --port=$HEALTH_CHECK_PORT \
 --check-interval=2s \
 --timeout=1s \
 --unhealthy-threshold=2 \
 --healthy-threshold=2

# Step 2: Create Backend Service
gcloud compute backend-services create $LB_NAME-backend \
 --load-balancing-scheme=INTERNAL \
 --protocol=TCP \
 --region=$REGION \
 --health-checks=$HEALTH_CHECK_NAME \
 --health-checks-region=$REGION

# Step 3: Add Backend Instance Group
gcloud compute backend-services add-backend $LB_NAME-backend \
 --instance-group=wsfc-group-3 \
 --instance-group-zone=us-west1-a \
 --region=$REGION

# Step 4: Create Frontend Configuration (Forwarding Rule)
gcloud compute forwarding-rules create $LB_NAME \
 --load-balancing-scheme=INTERNAL \
 --ports=$BACKEND_PORTS \
 --network=$NETWORK \
 --subnet=$SUBNET \
 --region=$REGION \
 --backend-service=$LB_NAME-backend \
 --ip-protocol=TCP \
 --address=$LB_INTERNAL_IP
 --allow-global-access

Update the original load balancing forwarding rule to allow global access

gcloud compute forwarding-rules update wsfc-lb --region=us-east1 --allow-global-access

This completes the steps to extend your SQL Server cluster across multiple GCP regions. In the future I’ll probably write a single comprehensive how-to guide, but for now, let me know if you have any questions!

How to use Azure Site Recovery (ASR) to replicate a Windows Server Failover Cluster (WSFC) that uses SIOS DataKeeper for cluster storage

October 12, 2022October 7, 2022 davebermasr, azuresiterecovery, Cluster, disasterrecovery, dr, SAPLeave a comment

Intro

So you have built a SQL Server Failover Cluster Instance (FCI), or maybe an SAP ASCS/ERS cluster in Azure. Each node of the cluster resides in a different Availability Zone (AZ), or maybe you have strict latency requirements and are using Placement Proximity Groups (PPG) and your nodes all reside in the same Availability Set. Regardless of the scenario, you now have a much higher level of availability for your business critical application than if you were just running a single instance.

Now that you have high availability (HA) covered, what are you going to do for disaster recovery? Regional disasters that take out multiple AZs are rare, but as recent history has shown us, Mother Nature can really pack a punch. You want to be prepared should an entire region go offline.

Azure Site Recovery (ASR) is Microsoft’s disaster recovery-as-a-service (DRaaS) offering that allows you to replicate entire VMs from one region to another. It can also replicate virtual machines and physical servers from on-prem into Azure, but for the purpose of this blog post we will focus on the Azure Region-to-Region DR capabilities.

Setting up ASR

We are going to assume you have already built your cluster using SIOS DataKeeper. If not, here are some pointers to help get you started.

Failover Cluster Instances with SQL Server on Azure VMs

SIOS DataKeeper Cluster Edition for the SAP ASCS/SCS cluster share disk

We are also going to assume you are familiar with Azure Site Recovery. Instead of yet another guide on setting up ASR, I suggest you read the latest documentation from Microsoft. This article will focus instead on some things you may not have considered and the specific steps required to fix your cluster after a failover to a different subnet.

Paired Regions

Before you start down the DR path, you should be aware of the concept of Azure Paired Regions. Every Region in Azure has a preferred DR Region. If you want to learn more about Paired Regions, the documentation provides a great background. There are some really good benefits of using your paired region, but it’s really up to you to decide on what region you want to use to host your DR site.

Cloud Witness Location

When you originally built your cluster you had to choose a witness type for your quorum. You may have selected a File Share Witness or a Cloud Witness. Typically either of those witness types should reside in an AZ that is separate from your cluster nodes.

However, when you consider that, in the event of a disaster, your entire cluster will be running in your DR region, there is a better option. You should use a cloud witness, and place it in your DR region. By placing your cloud witness in your DR region, you provide resiliency not only for local AZ failures, but it also protects you should the entire region fail and you have to use ASR to recover your cluster in the DR region. Through the magic of Dynamic Quorum and Dynamic Witness, you can be sure that even if your DR region goes offline temporarily, it will not impact your production cluster.

Multi-VM Consistency

When using ASR to replicate a cluster, it is important to enable Multi-VM Consistency to ensure that each cluster node’s recovery point is from the same point in time. That ensures that the DataKeeper block level replication occurring between the VMs will be able to continue after recovery without requiring a complete resync.

Crash Consistent Recovery Points

Application consistent recovery points are not supported in replicated clusters. When configuring the ASR replication options do not enable application consistent recovery points.

Keep IP Address After Failover?

When using ASR to replicate to your DR site there is a way to keep the IP address of the VMs the same. Microsoft described it in the article entitled Retain IP addresses during failover. If you can keep the IP address the same after failover it will simplify the recovery process since you won’t have to fix any cluster IP addresses or DataKeeper mirror endpoints, which are based on IP addresses.

However, in my experience, I have never seen anyone actually follow the guidance above, so recovering a cluster in a different subnet will require a few additional steps after recovery before you can bring the cluster online.

Your First Failover Attempt

Recovery Plan

Because you are using Multi-VM Consistency, you have to failover your VMs using a Recovery Plan. The documentation provides pretty straightforward guidance on how to do that. A Recovery Plan groups the VMs you want to recover together to ensure they all failover together. You can even add multiple groups of VMs to the same Recovery Plan to ensure that your entire infrastructure fails over in an orderly fashion.

A Recovery Plan can also launch post recovery scripts to help the failover complete the recovery successfully. The steps I describe below can all be scripted out as part of your Recovery Plan, thereby fully automating the complete recovery process. We will not be covering that process in this blog post, but Microsoft documents this process.

Static IP Addresses

As part of the recovery process you want to make sure the new VMs have static IP addresses. You will have to adjust the interface properties in the Azure Portal so that the VM always uses the same address. If you want to add a public IP address to the interface you should do so at this time as well.

Network Configuration

After the replicated VMs are successfully recovered in the DR site, the first thing you want to do is verify basic connectivity. Is the IP configuration correct? Are the instances using the right DNS server? Is name resolution functioning correctly? Can you ping the remote servers?

If there are any problems with network communications then the rest of the steps described below will be bound to fail. Don’t skip this step!

Load Balancer

As you probably know, clusters in Azure require you to configure a load balancer for client connectivity to work. The load balancer does not fail over as part of the Recovery Plan. You need to build a new load balancer based on the cluster that now resides in this new vNet. You can do this manually or script this as part of your Recovery Plan to happen automatically.

Network Security Groups

Running in this new subnet also means that you have to specify what Network Security Group you want to apply to these instances. You have to make sure the instances are able to communicate across the required ports. Again, you can do this manually, but it would be better to script this as part of your Recovery Plan.

Fix the IP Cluster Addresses

If you are unable to make the changes described earlier to recover your instances in the same subnet, you will have to complete the following steps to update your cluster IP addresses and the DataKeeper addresses for use in the new subnet.

Every cluster has a core cluster IP address. What you will see if you launch the WSFC UI after a failover is that the cluster won’t be able to connect. This is because the IP address used by the cluster is not valid in the new subnet.

If you open the properties of that IP Address resource you can change the IP address to something that works in the new subnet. Make sure to update the Network and Subnet Mask as well.

Once you fix that IP Address you will have to do the same thing for any other cluster address that you use in your cluster resources.

Fix the DataKeeper Mirror Addresses

SIOS DataKeeper mirrors use IP addresses as mirror endpoints. These are stored in the mirror and mirror job. If you recover a DataKeeper based cluster in a different subnet, you will see that the mirror comes up in a Resync Pending state. You will also notice that the Source IP and the Target IP reflect the original subnet, not the subnet of the DR site.

Fixing this issue involves running a command from SIOS called CHANGEMIRRORENDPOINTS. The usage for CHANGEMIRRORENDPOINTS is as follows.

emcmd <NEW source IP> CHANGEMIRRORENDPOINTS <volume letter> <ORIGINAL target IP> <NEW source IP> <NEW target IP>

In our example, the command and output looked like this.

After the command runs, the DataKeeper GUI will be updated to reflect the new IP addresses as shown below. The mirror will also go to a mirroring state.

Conclusions

You have now successfully configured and tested disaster recovery of your business critical applications using a combination of SIOS DataKeeper for high availability and Azure Site Recovery for disaster recovery. If you have questions please leave me a comment or reach out to me on Twitter @daveberm

Performance of Azure Shared Disk with Zone Redundant Storage (ZRS)

May 11, 2022 daveberm1 Comment

On September 9th, 2021, Microsoft announced the general availability of Zone-Redundant Storage (ZRS) for Azure Disk Storage, including Azure Shared Disk.

What makes this interesting is that you can now build shared storage based failover cluster instances that span Availability Zones (AZ). With cluster nodes residing in different AZs, users can now qualify for the 99.99% availability SLA. Prior to support for ZRS, Azure Shared Disks only supported Locally Redundant Storage (LRS), limiting cluster deployments to a single AZ, leaving users susceptible to outages should an AZ go offline.

There are however a few limitations to be aware of when deploying an Azure Shared Disk with ZRS.

Only supported with premium solid-state drives (SSD) and standard SSDs. Azure Ultra Disks are not supported.
Azure Shared Disks with ZRS are currently only available in West US 2, West Europe, North Europe, and France Central regions
Disk Caching, both read and write, are not supported with Premium SSD Azure Shared Disks
Disk bursting is not available for premium SSD
Azure Site Recovery support is not yet available.
Azure Backup is available through Azure Disk Backup only.
Only server-side encryption is supported, Azure Disk Encryption is not currently supported.

I also found an interesting note in the documentation.

“Except for more write latency, disks using ZRS are identical to disks using LRS, they have the same scale targets. Benchmark your disks to simulate the workload of your application and compare the latency between LRS and ZRS disks.”

While the documentation indicates that ZRS will incur some additional write latency, it is up to the user to determine just how much additional latency they can expect. A link to a disk benchmark document is provided to help guide you in your performance testing.

Following the guidance in the document, I used DiskSpd to measure the additional write latency you might experience. Of course results will vary with workload, disk type, instance size, etc.,but here are my results.

	Locally Redundant Storage (LRS)	Zone Redundant Storage (ZRS)
Write IOPS	5099.82	4994.63
Average Latency	7.830	7.998

The DiskSpd test that I ran used the following parameters.

diskspd -c200G -w100 -b8K -F8 -r -o5 -W30 -d10 -Sh -L testfile.dat

I wrote to a P30 disk with ZRS and a P30 with LRS attached to a Standard DS3 v2 (4 vcpus, 14 GiB memory) instance type. The shared ZRS P30 was also attached to an identical instance in a different AZ and added as shared storage to an empty cluster application.

A 2% overhead seems like a reasonable price to pay to have your data distributed synchronously across two AZs. However, I did wonder what would happen if you moved the clustered application to the remote node, effectively putting your disk in one AZ and your instance in a different AZ.

Here are the results.

	Locally Redundant Storage (LRS)	Zone Redundant Storage (ZRS)	ZRS when writing from the remote AZ
Write IOPS	5099.82	4994.63	4079.72
Average Latency	7.830	7.998	9.800

In that scenario I measured a 25% write latency increase. If you experience a complete failure of an AZ, both the storage and the instance will failover to the secondary AZ and you shouldn’t experience this increase in latency at all. However, other failure scenarios that aren’t AZ wide could very well have your clustered application running in one AZ with your Azure Shared Disk running in a different AZ. In those scenarios you will want to move your clustered workload back to a node that resides in the same AZ as your storage as soon as possible to avoid the additional overhead.

Microsoft documents how to initiate a storage account failover to a different region when using GRS, but there is no way to manually initiate the failover of a storage account to a different AZ when using ZRS. You should monitor your failover cluster instance to ensure you are alerted any time a cluster workload moves to a different server and plan to move it back just as soon as it is safe to do so.

You can find yourself in this situation unexpectedly, but it will also certainly happen during planned maintenance of the clustered application servers when you do a rolling update. Awareness is the key to help you minimize the amount of time your storage is performing in a degraded state.

I hope in the future Microsoft allows users to initiate a manual failover of a ZRS disk the same as they do with GRS. The reason they added the feature to GRS was to put the power in the hands of the users in case automatic failover did not happen as expected. In the case of ZRS I could see people wanting to try to tie together storage and application, ensuring they are always running in the same AZ, similar to how host based replication solutions like SIOS DataKeeper do it.

Clustering SAP #ASCS and #ERS on #AWS using Windows Server Failover Clustering

July 27, 2021July 27, 2021 davebermLeave a comment

When ensuring high availability for SAP ASCS and ERS running on WIndows Server, the primary cluster solution you will want to use is Windows Server Failover Clustering. However, when doing this in AWS you will quickly discover that there are a few obstacles you need to know how to overcome when deploying this in AWS.

I recently wrote this Step-by-Step guide that was published on the SAP blog that walks you through the entire process. If you have any questions, please leave a comment.

How to create a DataKeeper Replicated Volume that has Multiple Targets via CLI

June 1, 2021June 1, 2021 daveberm2 Comments

I often help people automate the configuration of their infrastructure so they can build 3-node clusters that span Availability Zones and Regions. The CLI for creating a DataKeeper Job and associated mirrors that contain more than one target can be a little confusing, so I’m documenting it here in case you find yourself looking for this information. The DataKeeper documentation describes this as a Mirror with Multiple Targets.

The environment in this example looks like this:

PRIMARY (10.0.2.100) – in AZ1
SECONDARY (10.0.3.100) – in AZ2
DR (10.0.1.10) – in a different Region

I want to create a synchronous mirror from PRIMARY to SECONDARY and an asynchronous mirror from PRIMARY to DR. I also have to make sure the DataKeeper Job knows how to create a mirror from SECONDARY to DR in case the SECONDARY or DR server ever become the source of the mirror. EMCMD will be used to create this multiple target mirror.

We need to first create the Job that contains all this possible endpoints and define whether the mirror will be Sync (S) or Async (A) between those endpoints.

emcmd . createjob ddrive sqldata primary.datakeeper.local D 10.0.2.100 secondary.datakeeper.local D 10.0.3.100 S primary.datakeeper.local D 10.0.2.100 dr.datakeeper.local D 10.0.1.10 A secondary.datakeeper.local D 10.0.3.100 dr.datakeeper.local D 10.0.1.10 A

That single “createjob” command creates the Job. It might be a little easier to look at that command like this:

emcmd . createjob ddrive sqldata 
primary.datakeeper.local D 10.0.2.100 secondary.datakeeper.local D 10.0.3.100 S 
primary.datakeeper.local D 10.0.2.100 dr.datakeeper.local D 10.0.1.10 A 
secondary.datakeeper.local D 10.0.3.100 dr.datakeeper.local D 10.0.1.10 A

Next we need to create the mirrors.

emcmd 10.0.2.100 createmirror D 10.0.1.10 A
emcmd 10.0.2.100 createmirror D 10.0.3.100 S

Our DataKeeper Job should now look like this in the DataKeeper UI

One-to-many DataKeeper Replicated Volume

And then finally we can register the DataKeeper Volume Resource in the cluster Available Storage with this command.

emcmd . registerclustervolume D

The DataKeeper Volume Resource will now appear in Available Storage as shown below.

You are now ready to install SQL Server, SAP, File Server or any other clustered resource you normally protect with Windows Server Failover Clustering.

How to extend your existing SQL Server Failover Cluster Instance to the Cloud for Disaster Recovery

May 22, 2021May 23, 2021 davebermLeave a comment

I get asked this question all the time, so I figured it was time to write a blog post, record a video and write some code to automate the process so that it completes in under a minute.

First, some background. Typically when someone asks me how to do this, I point them to the DataKeeper documentation.

https://docs.us.sios.com/WindowsSPS/8.2/DKLE/DKCloudTechDoc/Content/DataKeeper/User_Guide/Extending_a_Traditional_2-Node_WSFC_Cluster_to_a_Third_Node_via_DataKeeper.htm.

This first document talks about extending the cluster and adding a 3rd node to the existing cluster. That’s fine if your cluster supports three nodes, but if you are using SQL Server Standard Edition, Microsoft limits you to a 2-node cluster. In the case of a 2-node cluster you can still replicate to a 3rd node, but the recovery will be more of a manual process. This process is described here.

https://docs.us.sios.com/WindowsSPS/8.6/SPS4W/TechDoc/Content/DataKeeper/User_Guide/Extend_to_Non_Cluster_Node.htm

People typically read these instructions and get a little worried. They feel like they would be performing open heart surgery on their cluster. It really is more like changing your shirt! You are simply replacing the Cluster Disk resource with a DataKeeper Volume resource. As you’ll see in the video below the process takes just a few seconds.

The code demonstrated in the video is show below.

Stop-ClusterGroup SQLServerGroup
Remove-ClusterResource -Name "Cluster Disk 1"
Set-Disk -Number 4 -IsOffline $False
Set-Disk -Number 4 -IsReadOnly $False
Import-Module -Name Storage
Set-Partition -DiskNumber 4 -PartitionNumber 1 -NewDriveLetter X
New-DataKeeperMirror -SourceIP 10.0.2.100 -SourceVolume X -TargetIP 10.0.1.10 -TargetVolume X -SyncType Sync
New-DataKeeperJob -JobName "x drive" -JobDescription "sql data" -Node1Name primary.datakeeper.local -Node1IP 10.0.2.100 -Node1Volume x -Node2Name dr.datakeeper.local -Node2IP 10.0.1.10 -Node2Volume X -SyncType Sync
Add-ClusterResource -Name "DataKeeper Volume X" -ResourceType "DataKeeper Volume" -Group "SQLServerGroup"
Get-ClusterResource "DataKeeper Volume X" | Set-ClusterParameter VolumeLetter X
Get-ClusterResource -Name 'SQLServer' | Add-ClusterResourceDependency -Provider 'DataKeeper Volume X'
Start-ClusterGroup SQLServerGroup

After you run that code don’t forget you also need to click on Manage Shared Volumes to add the backup node to the DataKeeper job as shown in the video.

If you have SQL Server Enterprise Edition then the final step would be to install SQL Server in the DR node and choose add node to existing cluster.

If you are using SQL Server Standard Edition then your job is done. You would simply follow these instructions to access you data on the 3rd node and then mount the replicated databases.

These directions are applicable whether your DR node is in the Cloud or your own DR site.

Using Amazon FSx for SQL Server Failover Cluster Instances – What You Need to Know!

January 8, 2021January 15, 2021 daveberm2 Comments

Intro

If you are considering deploying your own Microsoft SQL Server instances in AWS EC2 you have some decisions to make regarding the resiliency of the solution. Sure, AWS will offer you a 99.99% SLA on your Compute resources if you deploy two or more instances across different availability zones. But don’t be fooled, there are a lot of other factors you need to consider when calculating your true application availability. I recently blogged about how to calculate your application availability in the cloud. You probably should have a quick read of that article before you move on.

When it comes to ensuring your Microsoft SQL Server instance is highly available, it really comes down to two basic choices: Always On Availability Group (AG) or SQL Server Failover Cluster Instance (FCI). If you are reading this article I’m making an assumption you are well aware of both of these options and are seriously considering using a SQL Server FCI instead of a SQL Server Always On AG.

Benefits of a Microsoft SQL Server Failover Cluster Instance

The following list summarize what AWS says are the benefits of a SQL Server FCI:

FCI is generally preferable over AG for SQL Server high availability deployments when the following are priority concerns for your use case:

License cost efficiency: You need the Enterprise Edition license of SQL Server to run AGs, whereas you only need the Standard Edition license to run FCIs. This is typically 50–60% less expensive than the Enterprise Edition. Although you can run a Basic version of AGs on Standard Edition starting from SQL Server 2016, it carries the limitation of supporting only one database per AG. This can become a challenge when dealing with applications that require multiple databases like SharePoint.

Instance-level protection versus database-level protection: With FCI, the entire instance is protected – if the primary node becomes unavailable, the entire instance is moved to the standby node. This takes care of the SQL Server logins, SQL Server Agent jobs, certificates, etc. that are stored in the system databases, which are physically stored in shared storage. With AG, on the other hand, only the databases in the group are protected, and system databases cannot be added to an AG – only user databases are allowed. It is the database administrator’s responsibility to replicate changes to system objects on all AG replicas. This leaves the possibility of human error causing the database to become inaccessible to the application.

DTC feature support: If you’re using SQL Server 2012 or 2014, and your application uses Distributed Transaction Coordinator (DTC), you are not able to use an AG as it is not supported. Use FCI in this situation.

https://aws.amazon.com/blogs/storage/simplify-your-microsoft-sql-server-high-availability-deployments-using-amazon-fsx-for-windows-file-server/

Challenges with FCI in the Cloud

Of course, the challenge with building an FCI that spans availability zones is the lack of a shared storage device that is normally required when building a SQL Server FCI. Because the nodes of the cluster are distributed across multiple datacenters, a traditional SAN is not a viable option for shared storage. That leaves us with a two choices for cluster storage: 3rd party storage class resources like SIOS DataKeeper or the new Amazon FSx. Let’s take a look at what you need to know before you make your choice.

Service Level Agreement

As I wrote in how to calculate your application availability, your overall application SLA is only as good as your weakest link. In this case, the FSx SLA of 99.9%.

Normally 99.99% availability represents the starting point of what is considered “highly available”. This is what AWS promises you for your compute resources when two or more are deployed in different availability zones.

In case you didn’t know the difference between three nines and four nines…

99.9% availability allows for 43.83 minutes of downtime per month
99.99% availability allows for only 4.38 minutes of downtime per month

By hosting your cluster storage on FSx despite your 99.99% compute availability, your overall application availability will be 99.9%. In contrast, EBS volumes that span availability zones, such as in a DataKeeper deployment, qualifies for the 99.99% SLA at both the storage and compute layers, meaning your overall application availability is 99.99%.

Storage Location

When configuring FSx for high availability, you will want to enable multi-AZ support. By enabling multi-AZ you have an effectively have a “preferred” AZ and a “standby” AZ. When you deploy your SQL Server FCI nodes you will want to distribute those nodes across the same AZs.

Now in normal situations, you will want to make sure the active cluster node resides in the same AZ as the preferred FSx storage node. This is to minimize the distance and latency to your storage, but also to minimize the costs associated with data transfer across AZs. As specified in the FSx price guide, “Standard data transfer fees apply for inter-AZ or inter-region access to file systems.”

In the unfortunate circumstance where you have a SQL Server FCI failure, but not a FSx failure, there is no mechanism to tie both the storage and compute together. In the event that FSx fails over, it will automatically fail back to the primary availability zone. However, best practices dictate SQL FCI remain running on the secondary node until root cause analysis is performed and fail back is typically scheduled to occur during maintenance periods. This leaves you in a situation where your storage resides in a different AZ, which will incur additional costs. Currently the cost for transferring data across AZs, both ingress and egress, is $0.01/GB.

Without keeping a close eye on the state of your FSx and SQL Server FCI, you may not even be aware that they are running in different regions until you see the data transfer charge at the end of the month.

In contrast, in a configuration that use SIOS DataKeeper, the storage failover is part of the SQL Server FCI recovery, ensuring that the storage always fails over with the SQL Server instance. This ensures SQL Server is always reading and writing to the EBS volumes that are directly attached to the active node. Keep in mind, DataKeeper will incur a data transfer cost associated with write operations which are replicated between AZs or regions. This data transfer cost can be minimized with the use of compression available in DataKeeper.

Controlling Failover

In an FSx multi-subnet configuration there is a preferred availability zone and a standby availability. Should the FSx file server in the preferred availability zone experience a failure, the file server in the standby AZ will recover. AWS reports that this recovery time takes about 30 seconds with standard shares. With the use of continuously available file shares Microsoft reports this failover time can be closer to 15 seconds. During this failover time, there is a brownout that occurs where reads and writes are paused, but will continue once recovery completes.

FSx multi-site has automatic failback enabled, meaning that for every unplanned failover of FSx, you also have an unplanned failback. In contrast, typically when a SQL Server FCI experience an unplanned failover you would either just leave it running on the secondary or schedule a failback after hours or during the next maintenance period.

SQL Server Analysis Services Cluster Not Supported with FSx

If you want to cluster SSAS, you will need a clustered disk resource like SIOS DataKeeper. The How to Cluster SQL Server Analysis Server white paper clearly states that SMB cannot be used and that cluster drives with drive letters must be used. In contrast, the DataKeeper Volume resource presents itself as a clustered disk and can be used with SSAS.

Summary

While FSx certainly can make sense for typical SMB uses like Windows user files and other non-critical services where 99.9% availability SLA suffices, FSx is an excellent option If you application requires high availability (99.99%) or HA/DR solutions that also span regions, the SIOS DataKeeper is the right fit.

Step-by-Step: iSCSI Target Server Cluster in Azure

May 20, 2020 daveberm#WindowsAzure #Azure #Cloud, iSCSI Target3 Comments

I recently helped someone build an iSCSI target server cluster in Azure and realized that I never wrote a step-by-step guide for that particular configuration. So to remedy that, here are the step-by-step instructions in case you need to do this yourself.

Pre-requisites

I’m going to assume you are fairly familiar with Azure and Windows Server, so I’m going to spare you some of the details. Let’s assume you have at least done the following as a pre-requisite

Provision two servers (SQL1, SQL2) each in a different Availability Zone (Availability Set is also possible, but Availability Zones have a better SLA)
Assign static IP addresses to them through the Azure portal
Joined the servers to an existing domain
Enabled the Failover Clustering feature and the iSCSI Target Server feature on both nodes
Add three Azure Premium Disk to each node.
NOTE: this is optional, one disk is the minimum required. For increased IOPS we are going to stripe three Premium Azure Disks together in a storage pool and create a simple (RAID 0) virtual disk
SIOS DataKeeper is going to be used to provided the replicated storage used in the cluster. If you need DataKeeper you can request a trial here.

Create local Storage Pool

Once again, this step is completely optional, but for increased IOPS we are going to stripe together three Azure Premium Disks into a single Storage Pool. You might be tempted to use Dynamic Disk and a spanned volume instead, but don’t do that! If you use dynamic disks you will find out that there is some general incompatibility that will prevent you from creating iSCSI targets later.

Don’t worry, creating a local Storage Pool is pretty straight forward if you are aware of the pitfalls you might encounter as described below. The official documentation can be found here.

Pitfall #1 – although the documentation says the minimum size for a volume to be used in a storage pool is 4 GB, I found that the P1 Premium Disk (4GB) was NOT recognized. So in my lab I used 16GB P3 Premium Disks.

Pitfall #2 – you HAVE to have at least three disks to create a Storage Pool.

Pitfall #3 – create your Storage Pool before you create your cluster. If you try to do it after you create your cluster you are going to wind up with a big mess as Microsoft tries to create a clustered storage pool for you. We are NOT going to create a clustered storage pool, so avoid that mess by creating your Storage Pool before you create the cluster. If you have to add a Storage Pool after the cluster is created you will first have to evict the node from the cluster, then create the Storage Pool.

Based on the documentation found here, below are the screenshots that represent what you should see when you build your local Storage Pool on each of the two cluster nodes. Complete these steps on both servers BEFORE you build the cluster.

You should see the Primordial pool on both servers.

Right-click and choose New Storage Pool…

Choose Create a virtual disk when this wizard closes

Notice here you could create storage tiers if you decided to use a combination of Standard, Premium and Ultra SSD

For best performance use Simple storage layout (RAID 0). Don’t be concerned about reliability since Azure Managed Disks have triple redundancy on the backend. Simple is required for optimal performance.

For performance purposes use Fixed provisioning. You are already paying for the full Premium disk anyway, so no need not to use it all.

Now you will have a 45 GB X drive on your first node. Repeat this entire process for the second node.

Create your Cluster

Now that each server each have their own 45 GB X drive, we are going to create the basic cluster. Creating a cluster in Azure is best done via Powershell so that we can specify a static IP address. If you do it through the GUI you will soon realize that Azure assigns your cluster IP a duplicate IP address that you will have to clean up, so don’t do that!

Here is an example Powershell code to create a new cluster.

 New-Cluster -Name mycluster -NoStorage -StaticAddress 10.0.0.100 -Node sql1, sql2

The output will look something like this.

PS C:\Users\dave.DATAKEEPER> New-Cluster -Name mycluster -NoStorage -StaticAddress 10.0.0.100 -Node sql1, sql2
WARNING: There were issues while creating the clustered role that may prevent it from starting. For more information view the report file below.
WARNING: Report file location: C:\windows\cluster\Reports\Create Cluster Wizard mycluster on 2020.05.20 At 16.54.45.htm

Name     
----     
mycluster

The warning in the report will tell you that there is no witness. Because there is no shared storage in this cluster you will have to create either a Cloud Witness or a File Share Witness. I’m not going to walk you through that process as it is pretty well documented at those links.

Don’t put this off, go ahead and create the witness now before you move to the next step!

You now should have a basic 2-node cluster that looks something like this.

Configure a Load Balancer for the Cluster Core IP Address

Clusters in Azure are unique in that the Azure virtual network does not support gratuitous ARP. Don’t worry if you don’t know what that means, all you have to really know is that cluster IP addresses can’t be reached directly. Instead, you have to use an Azure Load Balancer, which redirects the client connection to the active cluster node.

There are two steps to getting a load balancer configured for a cluster in Azure. The first step is to create the load balancer. The second step is to update the cluster IP address so that it listens for the load balancer’s health probe and uses a 255.255.255.255 subnet mask which enables you to avoid IP address conflicts with the ILB.

We will first create a load balancer for the cluster core IP address. Later we will edit the load balancer to also address the iSCSI cluster resource IP address that we will be created at the end of this document.

Step 1 – Create a Standard Load Balancer

Notice that the static IP address we are using is the same address that we used to create the core cluster IP resource.

Once the load balancer is created you will edit the load balancer as shown below

Add the two cluster nodes to the backend pool

Add a health probe. In this example we use 59999 as the port. Remember that port, we will need it in the next step.

Create a new rue to redirect all HA ports, Make sure Floating IP is enabled.

Step 2 – Edit to cluster core IP address to work with the load balancer

As I mentioned earlier, there are two steps to getting the load balancer configured to work properly. Now that we have a load balancer, we have to run a Powershell script on one of the cluster nodes. The following is an example script that needs to be run on one of the cluster nodes.

$ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “Cluster IP Address” 
$ILBIP = “10.0.0.100” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59998;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0}

The important thing about the script above, besides getting all the variables correct for your environment, is making sure the ProbePort is set to the same port you defined in your load balancer settings for this particular IP address. You will see later that we will create a 2nd health probe for the iSCSI cluster IP resource that will use a different port. The other important thing is making sure you leave the subnet set at 255.255.255.255. It may look wrong, but that is what it needs to be set to.

After you run it the output should look like this.

 PS C:\Users\dave.DATAKEEPER> $ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “Cluster IP Address” 
$ILBIP = “10.0.0.100” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59999;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0}
WARNING: The properties were stored, but not all changes will take effect until Cluster IP Address is taken offline and then online again.

You will need to take the core cluster IP resource offline and bring it back online again before it will function properly with the load balancer.

Assuming you did everything right in creating your load balancer, your Server Manager on both servers should list your cluster as Online as shown below.

Check Server Manager on both cluster nodes. Your cluster should show as “Online” under Manageability.

Install DataKeeper

I won’t go through all the steps here, but basically at this point you are ready to install SIOS DataKeeper on both cluster nodes. It’s a pretty simple setup, just run the setup and choose all the defaults. If you run into any problems with DataKeeper it is usually one of two things. The first issue is the service account. You need to make sure the account you are using to run the DataKeeper service is in the Local Administrators Group on each node.

The second issue is in regards to firewalls. Although the DataKeeper install will update the local Windows Firewall automatically, if your network is locked down you will need to make sure the cluster nodes can communicate with each other across the required DataKeeper ports. In addition, you need to make sure the ILB health probe can reach your servers.

Once DataKeeper is installed you are ready to create your first DataKeeper job. Complete the following steps for each volume you want to replicate using the DataKeeper interface

Use the DataKeeper interface to connect to both servers

Click on create new job and give it a name

Click Yes to register the DataKeeper volume in the cluster

Once the volume is registered it will appear in Available Storage in Failover CLuster Manager

Create the iSCSI target server cluster

In this next step we will create the iSCSI target server role in our cluster. In an ideal world I would have a Powershell script that does all this for you, but for sake of time for now I’m just going to show you how to do it through the GUI. If you happen to write the Powershell code please feel free to share with the rest of us!

There is one problem with the GUI method. ou will wind up with a duplicate IP address in when the IP Resource is created, which will cause your cluster resource to fail until we fix it. I’ll walk through that process as well.

Go to the Properties of the failed IP Address resource and choose Static IP and select an IP address that is not in use on your network. Remember this address, we will use it in our next step when we update the load balancer.

You should now be able to bring the iSCSI cluster resource online.

Update load balancer for iSCSI target server cluster resource

Like I mentioned earlier, clients can’t connect directly to the cluster IP address (10.0.0.110) we just created for the iSCSI target server cluster. We will have to update the load balancer we created earlier as shown below.

Start by adding a new frontend IP address that uses the same IP address that the iSCSI Target cluster IP resource uses.

Add a second health probe on a different port. Remember this port number, we will use it again in the powershell script we run next

We add one more load balancing rule. Make sure to change the Frontend IP address and Health probe to use the ones we just created. Also make sure direct server return is enabled.

The final step to allow the load balancer to work is to run the following Powershell script on one of the cluster nodes. Make sure you use the new Healthprobe port, IP address and IP Resource name.

 $ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “IP Address 10.0.0.0” 
$ILBIP = “10.0.0.110” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59998;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0}

Your output should look like this.

 PS C:\Users\dave.DATAKEEPER> $ClusterNetworkName = “Cluster Network 1” 
$IPResourceName = “IP Address 10.0.0.0” 
$ILBIP = “10.0.0.110” 
Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{Address=$ILBIP;ProbePort=59998;SubnetMask="255.255.255.255";Network=$ClusterNetworkName;EnableDhcp=0}
WARNING: The properties were stored, but not all changes will take effect until IP Address 10.0.0.0 is taken offline and then online again.

Make sure to take the resource offline and online for the settings to take effect.

Create your clustered iSCSI targets

Before you begin, it is best to check to make sure Server Manager from BOTH servers can see the two cluster nodes, plus the two cluster name resources, and they both appear “Online” under manageability as shown below.

If either server has an issue querying either of those cluster names then the next steps will fail. If there is a problem I would double check all the steps you took to create the load balancer and the Powershell scripts you ran.

We are now ready to create our first clustered iSCSI targets. From either of the cluster nodes, follows the steps illustrated below as an example on how to create iSCSI targets.

Of course assign this to whichever server or servers will be connecting to this iSSI target.

And there you have it, you now have a functioning iSCSI target server in Azure.

If you build this leave a comment and me know how you plan to use it!

STEP-BY-STEP: HOW TO CONFIGURE A SQL SERVER 2008 R2 FAILOVER CLUSTER INSTANCE ON WINDOWS SERVER 2008 R2 IN AZURE OR AZURE STACK

April 19, 2019April 17, 2020 daveberm6 Comments

Intro

On July 9, 2019, support for SQL Server 2008 and 2008 R2 will end. That means the end of regular security updates. However, if you move those SQL Server instances to Azure or Azure Stack (I will simply refer to both as Azure for the rest of the guide), Microsoft will give you three years of Extended Security Updates at no additional charge. If you are currently running SQL Server 2008/2008 R2 and you are unable to update to a later version of SQL Server before the July 9th deadline, you will want to take advantage of this offer rather than running the risk of facing a future security vulnerability. An unpatched instance of SQL Server could lead to data loss, downtime or a devastating data breach.

One of the challenges you will face when running SQL Server 2008/2008 R2 in Azure is ensuring high availability. On premises you may be running a SQL Server Failover Cluster (FCI) instance for high availability, or possibly you are running SQL Server in a virtual machine and are relying on VMware HA or a Hyper-V cluster for availability. When moving to Azure, none of those options are available. Downtime in Azure is a very real possibility that you must take steps to mitigate.

In order to mitigate the possibility of downtime and qualify for Azure’s 99.95% or 99.99% SLA, you have to leverage SIOS DataKeeper. DataKeeper overcomes Azure’s lack of shared storage and allows you to build a SQL Server FCI in Azure that leverages the locally attached storage on each instance. SIOS DataKeeper not only supports SQL Server 2008 R2 and Windows Server 2008 R2 as documented in this guide, it supports any version of Windows Server, from 2008 R2 through Windows Server 2019 and any version of SQL Server from from SQL Server 2008 through SQL Server 2019.

This guide will walk through the process of creating a two-node SQL Server 2008 R2 Failover Cluster Instance (FCI) in Azure, running on Windows Server 2008 R2. Although SIOS DataKeeper also supports clusters that span Availability Zones or Regions, this guide assumes each node resides in the same Azure Region, but different Fault Domains. SIOS DataKeeper will be used in place of the shared storage normally required to create a SQL Server 2008 R2 FCI.

Pre-Requisites

Active Directory
This guide assumes you have an existing Active Directory Domain. You can manage your own Domain Controllers or use Azure Active Directory Domain Services. For this tutorial we will connect to a domain called contoso.local. Of course you will connect to your own domain when following this tutorial.

Open Firewall Ports
– SQL Server:1433 for Default Instance
– Load Balancer Health Probe: 59999
– DataKeeper: these firewall rules are added to the Windows host based firewall automatically during installation. For details on which ports are opened consult the SIOS documentation.
– Keep in mind, if you have any network based security in place that blocks ports between the cluster nodes you will need to account for these ports there as well.

DataKeeper Service Account
Create a Domain account. We will specify this account when we install DataKeeper. This account will need to be added to the Local Administrators group on each node of the cluster.

Create the first SQL Server Instance in Azure

This guide will leverage the SQL Server 2008R2SP3 on Windows Server 2008R2 image that is published in the Azure Marketplace.

When you provision the first instance you will have to create a new Availability Set. During this process be sure to increase the number of Fault Domains to 3. This allows the two cluster nodes and the file share witness each to reside in their own Fault Domain.

If you don’t already have a virtual network configured, allow the creation wizard to create a new one for you.

Once the instance is created, go in to the IP configurations and make the Private IP address static. This is required for SIOS DataKeeper and is best practice for clustered instances.

Make sure that your virtual network is configured to set the DNS server to be a local Windows AD controller to ensure you will be able to join the domain in a later step.

After the virtual machines are provisioned, add at least two additional disks to each instance. Premium or Ultra SSD are recommended. Disable caching on the disks used for the SQL log files. Enable read-only caching on the disk used for the SQL data files. Refer to Performance guidelines for SQL Server in Azure Virtual Machines for additional information on storage best practices.

Create the 2nd SQL Server Instance in Azure

Follow the same steps as above, except be sure to place this instance in the same virtual network and Availability Set that you created with the 1st instance.

Create a File Share Witness (FSW) Instance

In order for the Windows Server Failover Cluster (WSFC) to work optimally you are required to create another Windows Server instance and place it in the same Availability Set as the SQL Server instances. By placing it in the same Availability Set you ensure that each cluster node and the FSW reside in different Fault Domains, ensuring your cluster stays on line should an entire Fault Domain go off line. This instances does not require SQL Server, it can be a simple Windows Server as all it needs to do is host a simple file share.

This instance will host the file share witness required by WSFC. This instance does not need to be the same size, nor does it require any additional disks to be attached. It’s only purpose is to host a simple file share. It can in fact be used for other purposes. In my lab environment my FSW is also my domain controller.

Uninstall SQL Server 2008 R2

Each of the two SQL Server instances provisioned already have SQL Server 2008 R2 installed on them. However, they are installed as standalone SQL Server instances, not clustered instances. SQL Server must be uninstalled from each of these instances before we can install the cluster instance. The easiest way to do that is to run the SQL Setup as shown below.

When you run setup.exe /Action-RunDiscovery you will see everything that is preinstalled

setup.exe /Action=RunDiscovery

Running setup.exe /Action=Uninstall /FEATURES=SQL,AS,RS,IS,Tools /INSTANCENAME=MSSQLSERVER kicks off the uninstall process

setup.exe /Action=Uninstall /FEATURES=SQL,AS,RS,IS,Tools /INSTANCENAME=MSSQLSERVER

Running setup.exe /Action-RunDiscovery confirms the uninstallation completed

setup.exe /Action-RunDiscovery

Run this uninstallation process again on the 2nd instance.

Add instances to the Domain

All three of these instances will need to be added to a Windows Domain. As mentioned in the Prerequisites section, you must have access to join an existing Windows Active Directory. In our case, we are joining a domain called contoso.local.

Add Windows Failover Clustering Feature

The Failover Clustering Feature needs to be added to the two SQL Server instances

Add-WindowsFeature Failover-Clustering

Install Convenience Rollup Update for Windows Server 2008 R2 SP1

There is a critical update ( kb2854082) that is required in order to configure a Windows Server 2008 R2 instance in Azure. That update and many more are included in the Convenience Rollup Update for Windows Server 2008 R2 SP1. Install this update on each of the two SQL Server instances.

Format the Storage

The additional disks that were attached when the two SQL Server instances were provisioned need to be formatted. Do the following for each volume on each instance.

Microsoft best practices says the following…

“NTFS allocation unit size: When formatting the data disk, it is recommended that you use a 64-KB allocation unit size for data and log files as well as TempDB.”

Run Cluster Validation

Run cluster validation to ensure everything is ready to be clustered.

Import-Module FailoverClusters
Test-Cluster -Node "SQL1", "SQL2"

Your report will contain WARNINGS about Storage and Networking. You can ignore those warnings as we know there are no shared disks and only a single network connection exists between the servers. You may also receive a warning about network binding order which can also be ignored. If you encounter any ERRORS you must address those before you continue.

Since there are no “Potential Cluster DIsks” available, the first test throws a warning and all the subsequent disks test are skipped. This is expected since we will be using just local disks replicated with SIOS DataKeeper.

The Validate Network Communication tests warn about just a single network being available between cluster nodes. You can ignore this warning since the network redundancy is handled at the virtual layer by Azure.

Error trying to run Cluster Validation?

I have encountered this error on a few occasions and I’m still trying to sort out under what conditions this occurs. Occasionally you will find that test-cluster fails to run as described in the forum post.

Test-Cluster
Unable to Validate a Cluster Configuration. The operation has failed. The action validate a configuration did not complete
There is an error in XML document (5, 73).  

Attempt by method

Microsoft.Xml.Serialzation.GeneratedAssembly.XmlSerialzationReaderClusterPrep.Config.Read4_As...Bolean) to access method

MS.Internal.ServerClusters.Validation.TestAssemblyCollection.Add(MS.Internal.ServerClusters.V....Failed

If this happens to you, I have found the following fix recommended in the forum post works for me.

Inside C:\Windows\System32\WindowsPowerShell\v1.0 make a copy of powershell_ise.exe.config file (make a copy inside C:\Windows\System32\WindowsPowerShell\v1.0)- rename it to powershell.exe.config

Open it with notepad- delete current config line and paste:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <system.xml.serialization>
    <xmlSerializer useLegacySerializerGeneration="true"/>
  </system.xml.serialization>
</configuration>
- save and run test-cluster

While this fix will allow you to run test-cluster from Powershell, I have found that running Validate through the GUI still throws an error, even with this fix. I have a query in to Microsoft to see if they have a solution, but for now if you need to run cluster Validation you may have to use Test-Cluster in Powershell.

Create the Cluster

Best practices for creating a cluster in Azure would be to use Powershell to create a cluster, specifying a static IP address. Powershell allows us to specify a Static IP Address, whereas the GUI method does not. Unfortunately, Azure’s implementation of DHCP does not work well with WSFC, so if you use the GUI method you will wind up with a duplicate IP address as the Cluster IP Address that will need to be fixed before the cluster is usable.

However, what I have found is that the typical New-Cluster powershell command with the -StaticAddress command doesn’t work. To avoid the problem of the duplicate IP address, we have to resort to the cluster.exe utility and run the following command.

cluster /cluster:cluster1 /create /nodes:"sql1 sql2" /ipaddress:10.0.0.100/255.255.255.0

Add the File Share Witness

Next we need to add the File Share Witness. On the 3rd server we provisioned as the FSW, create a folder and share it as shown below. You will need to grant the Cluster Name Object (CNO) read/write permissions at both the Share and Security levels as shown below.

Once the share is created, run the Configure Cluster Quorum wizard on one of the cluster nodes and follow the steps illustrated below.

Install DataKeeper

Install DataKeeper on each of the two SQL Server cluster nodes as shown below.

This is where we will specify the Domain account we added to each of the local Domain Administrators group.

Configure DataKeeper

Once DataKeeper is installed on each of the two cluster nodes you are ready to configure DataKeeper.

NOTE – The most common error encountered in the following steps is security related, most often by pre-existing Azure Security groups blocking required ports. Please refer to the SIOS documentation to ensure the servers can communicate over the required ports.

First you must connect to each of the two nodes.

If everything is configured properly, you should then see the following in the Server Overview report.

Next, create a New Job and follow the steps illustrated below

Choose Yes here to register the DataKeeper Volume resource in Available Storage

Complete the above steps for each of the volumes. Once you are finished, you should see the following in the WSFC UI.

You are now ready to install SQL Server into the cluster.

NOTE – At this point the replicated volume is only accessible on the node that is currently hosting Available Storage. That is expected, so don’t worry!

Install SQL Server on the first node

If you want to script the installation, I have included the example below of a scripted cluster installation of SQL Server 2008 R2 into the first node of cluster. The script to add a node to existing cluster is found further down in the guide.

Of course adjust for your environment.

c:\SQLServerFull\setup.exe /q /ACTION=InstallFailoverCluster /FEATURES=SQL /INSTANCENAME="MSSQLSERVER" /INSTANCEDIR="C:\Program Files\Microsoft SQL Server" /INSTALLSHAREDDIR="C:\Program Files\Microsoft SQL Server" /SQLSVCACCOUNT="contoso\admin" /SQLSVCPASSWORD="xxxxxxxxx" /AGTSVCACCOUNT="contoso\admin" /AGTSVCPASSWORD="xxxxxxxxx" /SQLDOMAINGROUP="contoso\SQLAdmins" /AGTDOMAINGROUP="contoso\SQLAdmins" /SQLCOLLATION="SQL_Latin1_General_CP1_CI_AS" /FAILOVERCLUSTERGROUP="SQL Server 2008 R2 Group" /FAILOVERCLUSTERDISKS="DataKeeper Volume E" "DataKeeper Volume F" /FAILOVERCLUSTERIPADDRESSES="IPv4;10.0.0.101;Cluster Network 1;255.255.255.0" /FAILOVERCLUSTERNETWORKNAME="SQL2008Cluster" /SQLSYSADMINACCOUNTS="contoso\admin" /SQLUSERDBLOGDIR="E:\MSSQL10.MSSQLSERVER\MSSQL\Log" /SQLTEMPDBLOGDIR="F:\MSSQL10.MSSQLSERVER\MSSQL\Log" /INSTALLSQLDATADIR="F:\MSSQL10.MSSQLSERVER\MSSQLSERVER" /IAcceptSQLServerLicenseTerms

If you prefer to use the GUI, just follow along with the screenshots below.

On the first node, run the SQL Server setup.

Choose New SQL Server Failover Cluster Installation and follow the steps as illustrated.

Choose only the options you need.

Please note, this document assumes you are using the Default instance of SQL Server. If you use a Named Instance you need to make sure you lock down the port that it listens on, and use that port later on when you configure the load balancer. You also will need to create a load balancer rule for the SQL Server Browser Service (UDP 1434) in order to connect to a Named Instance. Neither of those two requirements are covered in this guide, but if you require a Named Instance it will work if you do those two additional steps.

Here you will need to specify an unused IP address

Go to the Data Directories tab and relocate data and log files. At the end of this guide we talk about relocating tempdb to a non-mirrored DataKeeper Volume for optimal performance. For now, just keep it on one of the clustered disks.

Install SQL Server on the second node

Below is an example of the command you can run to add an additional SQL Server 2008 R2 node into an existing cluster.

c:\SQLServerFull\setup.exe /q /ACTION=AddNode /INSTANCENAME="MSSQLSERVER" /SQLSVCACCOUNT="contoso\admin" /SQLSVCPASSWORD="xxxxxxxxx" /AGTSVCACCOUNT="contoso\admin" /AGTSVCPASSWORD="xxxxxxxx" /IAcceptSQLServerLicenseTerms

If you prefer using the GUI, follow along with the following screenshots.

Run the SQL Server setup again on the second node and choose Add node to a SQL Server Failover Cluster.

Congratulations, you are almost done! However, due to Azure’s lack of support for gratuitous ARP, we will need to configure an Internal Load Balancer (ILB) to assist with client redirection as shown in the following steps.

Update the SQL Cluster IP Address

In order for the ILB to function properly, you must run run the following command from one of the cluster nodes. It SQL Cluster IP enables the SQL Cluster IP address to respond to the ILB health probe while also setting the subnet mask to 255.255.255.255 in order to avoid IP address conflicts with the health probe.

cluster res <IPResourceName> /priv enabledhcp=0 address=<ILBIP> probeport=59999  subnetmask=255.255.255.255

NOTE – I don’t know if it is a fluke, but on occasion I have run this command and it looks like it runs, but it doesn’t complete the job and I have to run it again. The way I can tell if it worked is by looking at the Subnet Mask of the SQL Server IP Resource, if it is not 255.255.255.255 then you know it didn’t run successfully. It may simple be a GUI refresh issue, so you can also try restarting the cluster GUI to verify the subnet mask was updated.

After it runs successfully, take the resource offline and bring it back online for the changes to take effect.

Create the Load Balancer

The final step is to create the load balancer. In this case we are assuming you are running the Default Instance of SQL Server, listening on port 1433.

The Private IP Address you define when you Create the load balancer will be the exact same address your SQL Server FCI uses.

Add just the two SQL Server instances to the backend pool. Do NOT add the FSW to the backend pool.

In this load balancing rule you must enable Floating IP

Validate the Cluster

Before you continue, run cluster validation one more time. The Cluster Validation report should return just the same network and storage warnings that it did the first time you ran it. Assuming there are no new errors or warnings, your cluster is configured correctly.

Edit sqlserv.exe Config File

In directory C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn we created a sqlps.exe.config file and sqlservr.exe.config with the following lines in the config file:

<configuration>
  <startup>
    <supportedRuntime version="v2.0.50727"/>
  </startup>
</configuration>

These files, by default, will not exist and may be created. If this file(s) already exists for your installation, the <supportedRuntime version=”v2.0.50727″/> line simply needs to be placed with the <startup>…</startup> sub-section of the <configuration>…</configuration> section. This should be done on both servers.

Test the Cluster

The most simple test is to open SQL Server Management Studio on the passive node and connect to the cluster. If you are able to connect, congratulations, you did everything correct! If you can’t connect don’t fear, you wouldn’t be the first person to make a mistake. I wrote a blog article to help troubleshoot the issue. Managing the cluster is exactly the same as managing a traditional shared storage cluster. Everything is controlled through Failover Cluster Manager.

Optional – Relocate Tempdb

For optimal performance it would be advisable to move tempdb to the local, non replicated, SSD. However, SQL Server 2008 R2 requires tempdb to be on a clustered disk. SIOS has a solution called a Non-Mirrored Volume Resource which addresses this issue. It would be advisable to create a non-mirrored volume resource of the local SSD drive and move tempdb there. However, the local SSD drive is non-persistent, so you must take care to ensure the folder holding tempdb and the permissions on that folder are recreated each time the server reboots.

After you create the Non-Mirrored Volume Resource of the local SSD, follow the steps in this article to relocate tempdb. The startup script described in that article must be added to each cluster node.

For More Information

As always, if you have questions or comments you can leave them in the comment section below or reach me on Twitter @daveberm

Achieving SQL Server HA/DR with a mix of Always On Availability Groups and SANless SQL Server Failover Cluster Instances

March 13, 2019March 15, 2019 daveberm6 Comments

Introduction

The topic of mixing SQL Server Failover Cluster Instances (FCI) with Always On Availability Groups (AG) is pretty well documented. However, most of the available documentation documents configurations that assume the SQL Server FCI portion of the solution utilizes shared storage. What if I want to build a SANless SQL Server FCI using Storage Spaces Direct (S2D), can I still add a SQL Server AG to the mix? Unfortunately, the answer to this question is no. As of today, this combination of S2D based SQL Server FCI and Always On AG is not supported. I previously blogged about this S2D limitation here.

However, the good news is you CAN build a SANless SQL Server FCI with SIOS DataKeeper and still leverage Always On AG for things like readable secondaries. You still have to abide by the same rules that apply when mixing traditional SAN based SQL Server FCI and Always On AGs, but other than that it is exactly the same….mostly.

DataKeeper Synchronous replication is commonly used between nodes in the same data center or cloud region, but you may want to replicate asynchronously to an additional node in a different region for disaster recovery. In this case, if you ever do have to bring the DR node online after an unexpected failure, you will have to scrap the Always On AG configurations and reconfigure them. This requirement is very similar to to what Microsoft published here in regards to restoring asynchronous snapshots of SQL Server Always On AGs running inside VMs.

Availability Groups

Essentially, a SANLess SQL Server FCI w/DataKeeper looks like a single instance of SQL Server as far as the Always On Availability Group Wizard is concerned. The configuration of the Always On AG is exactly the same as if you were creating just an Always On AG between two Standalone (non-clustered) SQL Server instances.

The real confusion arise in the fact that in this configuration all the servers reside in the same failover cluster, but the SQL Server FCI is only configured to run only on the cluster nodes where SQL Server was installed as a Clustered SQL Server Instance. The other nodes are in the same cluster, but SQL is installed on those nodes as a Standalone SQL Server Instance, not a Clustered Instance. It’s a bit confusing, but what is happening is that Always On AG’s leverage the WSFC quorum model and listeners, so all the AG Replicas need to reside in the same WSFC, even though they typically do not run clustered instances of SQL Server. If you are completely confused that is okay, most people are confused when they first try to wrap their head around this hybrid configuration.

The real benefit in a configuration like this is that a SQL Server FCI can be a better and more cost effective (more on this later*) HA solution than Always On AG in many circumstances, but it lacks the ability to offer a readable secondary replica. Adding an Always On AG readable secondary replica becomes a viable option to address this need. And using SIOS DataKeeper eliminates the need for a SAN for the SQL Server FCI, which opens up the possibility of configuring SQL Server FCIs where nodes reside in different data center, which also means support for SQL Server FCI’s that span Availability Zones in both Azure and AWS.

Please note that pictured below is just one possible configuration. Multiple FCI cluster nodes, multiple AGs and multiple Replicas are all supported. You are only limited by the limits imposed by your version of SQL Server.

This article seems to document the setup steps pretty well. Of course, instead of shared storage for the SQL FCI, you will use SIOS DataKeeper to build the FCI as I document here.

Image result for SQL Server FCI with Availability Groups

Basic Availability Groups

As of SQL Server 2016 a scaled down “Basic Availability Groups” became available in SQL Server Standard Edition, making this configuration possible even in SQL Server Standard Edition. Basic AGs are limited to a single database per Availability Group, a Single Replica (2-nodes). However, they do not support a readable secondary replica so their use cases in this hybrid configuration are very limited.

Distributed Availability Groups

Distributed AGs were introduced in SQL Server 2016 are also supported in this hybrid configuration. Distributed AGs are very similar to regular AGs, but the Replicas do not need to reside in the same cluster, or even in the same Windows Domain. Microsoft documents the the main use cases of Distributed Availability Groups as follows:

Disaster recovery and easier multi-site configurations
Migration to new hardware or configurations, which might include using new hardware or changing the underlying operating systems
Increasing the number of readable replicas beyond eight in a single availability group by spanning multiple availability groups

Image result for distributed availability groups

Summary

If you like the idea of SQL Server FCIs for high availability, but want the flexibility of read-only secondary replicas, this hybrid solution might just be the thing you are looking for. Traditional SAN baseds SQL Server FCIs, and even Storage Spaces Direct (S2D) based FCIs, limit you to a single data center. SIOS DataKeeper frees you from the limits of your SAN and enables configurations such as SQL Server FCI that span Availability Zones or Cloud Regions. It also eliminates the reliance on the SAN, allowing you to leverage locally attached high speed storage devices without giving up your SQL Server FCI.

* How to Save Money

Earlier I promised I would tell you how to save money by doing this all with SQL Server Standard Edition. If you can live with readable replicas that are point in time based snapshots, you can skip Always On AGs completely and just use the SIOS DataKeeper target side snapshot feature to periodically take an application consistent snapshot of the volumes on the target server without impacting ongoing replication or availability. Here’s how…

http://discover.us.sios.com/rs/siostechnology/images/10-Ways-Save-AlwaysOn-vs-Failover-Clustering.pdf

Create a 2-node SQL Server FCI with SQL Server Standard Edition and save a boatload of money on SQL licenses, but yet still replicate the data to a 3rd node outside the cluster for reporting or DR purposes. If you take a snapshot of the volumes on this third server these snapshots are read-right accessible, so you can mount those databases from a standalone instance of SQL Server to run month end reports, copy to archives, or you might even want to use those snapshot to quickly and easily update your QA and Test/Dev environments with the latest SQL data.

I hope you found this helpful and informative. As always, if you have questions, add them here or reach me on Twitter @daveberm