Azure Outage Post-Mortem – Part 1

September 11, 2018September 11, 2018 davebermLeave a comment

The first official Post-Mortems are starting to come out of Microsoft in regards to the Azure Outage that happened last week. While this first post-mortem addresses the Azure DevOps outage specifically (previously known as Visual Studio Team Service, or VSTS), it gives us some additional insight into the breadth and depth of the outage, confirms the cause of the outage, and gives us some insight into the challenges Microsoft faced in getting things back online quickly. It also hints at some some features/functionality Microsoft may consider pursuing to handle this situation better in the future.

As I mentioned in my previous article, features such as the new Availability Zones being rolled out in Azure, might have minimized the impact of this outage. In the post-mortem, Microsoft confirms what I previously said.

The primary solution we are pursuing to improve handling datacenter failures is Availability Zones, and we are exploring the feasibility of asynchronous replication.

Until Availability Zones are rolled out across more regions the only disaster recovery options you have are cross-region, hybrid-cloud or even cross-cloud asynchronous replication. Software based #SANless clustering solutions available today will enable such configurations, providing a very robust RTO and RPO, even when replicating great distances.

When you use SaaS/PaaS solutions you are really depending on the Cloud Service Provider (CSPs) to have an iron clad HA/DR solution in place. In this case, it seems as if a pretty significant deficiency was exposed and we can only hope that it leads all CSPs to take a hard look at their SaaS/PaaS offerings and address any HA/DR gaps that might exist. Until then, it is incumbent upon the consumer to understand the risks and do what they can to mitigate the risks of extended outages, or just choose not to use PaaS/SaaS until the risks are addressed.

The post-mortem really gets to the root of the issue…what do you value more, RTO or RPO?

I fundamentally do not want to decide for customers whether or not to accept data loss. I’ve had customers tell me they would take data loss to get a large team productive again quickly, and other customers have told me they do not want any data loss and would wait on recovery for however long that took.

It will be impossible for a CSP to make that decision for a customer. I can’t see a CSP ever deciding to lose customer data, unless the original data is just completely lost and unrecoverable. In that case, a near real-time async replica is about as good as you are going to get in terms of RPO in an unexpected failure.

However, was this outage really unexpected and without warning? Modern satellite imagery and improvements in weather forecasting probably gave fair warning that there was going to be significant weather related events in the area.

With hurricane Florence bearing down on the Southeast US as I write this post, I certainly hope if your data center is in the path of the hurricane you are taking proactive measures to gracefully move your workloads out of the impacted region. The benefit of a proactive disaster recovery vs a reactive disaster recovery are numerous, including no data loss, ample time to address unexpected issues, and managing human resources such that employees can worry about taking care of their families, rather than spending the night at a keyboard trying to put the pieces back together again.

Again, enacting a proactive disaster recovery would be a hard decision for a CSP to make on behalf of all their customers, as planned migrations across regions will incur some amount of downtime. This decision will have to be put in the hands of the customer.

Slide 2.png — Hurricane Florence Satellite Image taken from the new GOES-16 Satellite, courtesy of Tropical Tidbits

So what can you do to protect your business critical applications and data? As I discussed in my previous article, cross-region, cross-cloud or hybrid-cloud models with software based #SANless cluster solutions are going to go a long way to address your HA/DR concerns, with an excellent RTO and RPO for cloud based IaaS deployments. Instead of application specific solutions, software based, block level volume replication solutions such SIOS DataKeeper and SIOS Protection Suite replicate all data, providing a data protection solution for both Linux and Windows platforms.

My oldest son just started his undergrad degree in Meteorology at Rutgers University. Can you imagine a day when artificial intelligence (AI) and machine learning (ML) will be used to consume weather related data from NOAA to trigger a planned disaster recovery migration, two days before the storm strikes? I think I just found a perfect topic for his Master’s thesis. Or better yet, have him and his smart friends at the WeatherWatcher LLC get funding for a tech startup that applies AI and ML to weather related data to control proactive disaster recovery events.

I think we are just at the cusp of IT analytics solutions that apply advanced machine-learning technology to cut the time and effort you need to ensure delivery of your critical application services. SIOS iQ is one of the solutions leading the way in that field.

Batten down the hatches and get ready, Hurricane season is just starting and we are already in for a wild ride. If you would like to discuss your HA/DR strategy reach out to me on Twitter @daveberm.

Lightning Never Strikes Twice: Surviving the #Azure Cloud Outage

September 5, 2018 daveberm5 Comments

Yesterday morning I opened my Twitter feed to find that many people were impacted by an Azure outage. When I tried to access the resource page that described the outage and the current resources impacted even that page was unavailable. @AzureSupport was providing updates via Twitter.

The original update from @AzureSupport came in at 7:12 AM EDT

Azure Outage 2

Looking back on the Twitter feed it seems as if the problem initially began an hour or two before that.

Azure Support 10

It quickly became apparent that the outages had a wider spread impact than just the SOUTH CENTRAL US region as originally reported. It seems as if services that relied on Azure Active Directory could have been impacted as well and customers trying to provision new subscriptions were having issues.

Azure 11

And 24 hours later the problem has not been completely resolved and it according to the last update this morning…

Azure Outage 1

Untitled design (6)

So what could you have done to minimize the impact of this outage? No one can blame Microsoft for a natural disaster such as a lightning strike. But at the end of the day if your only disaster recovery plan is to call, tweet and email Microsoft until the issue is resolved, you just received a rude awakening. IT IS UP TO YOU to ensure you have covered all the bases when it comes to your disaster recovery plan.

While the dust is still settling on exactly what was impacted and what customers could have done to minimize the downtime, here are some of my initial thoughts.

Availability Sets (Fault Domains/Update Domains) – In this scenario, even if you built Failover Clusters, or leveraged Azure Load Balancers and Availability Sets, it seems the entire region went offline so you still would have been out of luck. While it is still recommended to leverage Availability Sets, especially for planned downtime, in this case you still would have been offline.

Availability Zones – While not available in the SOUTH CENTRAL US region yet, it seems that the concept of Availability Zones being rolled out in Azure could have minimized the impact of the outage. Assuming the lightning strike only impacted one datacenter, the other datacenter in the other Availability Zone should have remained operational. However, the outages of the other non-regional services such as Azure Active Directory (AAD) seems to have impacted multiple regions, so I don’t think Availability Zones would have isolated you completely.

Global Load Balancers, Cross Region Failover Clusters, etc. – Whether you are building SANLess clusters that cross regions, or using global load balancers to spread the load across multiple regions, you may have minimized the impact of the outage in SOUTH CENTRAL US, but you may have still been susceptible to the AAD outage.

Hybrid-Cloud, Cross Cloud – About the only way you could guarantee resiliency in a cloud wide failure scenario such as the one Azure just experienced is to have a DR plan that includes having realtime replication of data to a target outside of your primary cloud provider and a plan in place to bring applications online quickly in this other location. These two locations should be entirely independent and should not rely on services from your primary location to be available, such as AAD. The DR location could be another cloud provider, in this case AWS or Google Cloud Platform seem like logical alternatives, or it could be your own datacenter, but that kind of defeats the purpose of running in the cloud in the first place.

Software as a Service – While Software as service such as Azure Active Directory (ADD), Azure SQL Database (Database-as-Service) or one of the many SaaS offerings from any of the cloud providers can seem enticing, you really need to plan for the worst case scenario. Because you are trusting a business critical application to a single vendor you may have very little control in terms of DR options that includes recovery OUTSIDE of the current cloud service provider. I don’t have any words of wisdom here other than investigate your DR options before implementing any SaaS service, and if recovery outside of the cloud is not an option than think long and hard before you sign-up for that service. Minimally make the business stake owners aware that if the cloud service provider has a really bad day and that service is offline there may be nothing you can do about it other than call and complain.

I think in the very near future you will start to hear more and more about cross cloud availability and people leveraging solutions like SIOS DataKeeper to build robust HA and DR strategies that cross cloud providers. Truly cross cloud or hybrid cloud models are the only way to truly insulate yourself from most conceivable cloud outages.

If you were impacted from this latest outage I’d love to hear from you. Tell me what went down, how long you were down, and what you did to recover. What are you planning to do so that in the future your experience is better?

“Incomplete Communication with Cluster” with local Storage Space for SQL Server cluster

August 21, 2018August 21, 2018 davebermLeave a comment

When building a SANless SQL Server cluster with SIOS DataKeeper, or when configuring Always On Availability Groups for SQL Server, you may consider striping together multiple disk in a Simple Storage Space (RAID 0) for performance. This is very commonly done in the cloud where each instance typically his backed by hardware resiliency, so RAID 0 is not really all that risky.

For instance, I had a recent customer in AWS that wanted to max out his IOPS to 80,000, the maximum IOPS currently available to a single instance. Now keep in mind, only the largest EBS optimized instance sizes supports 80,000 IOPS, so you want to make sure you know what maximum IOPS your particular instance size supports.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

In this case we had ac5.18xlarge instance which does support 80,000 IOPS. However, any individual EBS Provisioned IOPS volume only supports up to 32,000 IOPS. The only way to achieve 80,000 IOPS when writing to any single volume is to strip three of these volumes together in a Simple Storage Space.

Herein lies the rub, if you try to do that in an existing cluster things are going to go haywire pretty fast. Fellow MVP Joey D’Antoni recently blogged about the issue and it appears to still be an issue in the Windows Server 2019 preview.

Just as Joey suggests, I always advise my customers to build out the nodes and any Storage Spaces BEFORE they start the clustering process. This makes the process go much smoother. It also allows the customer to have some time to benchmark the server’s performance before they add any replication, to ensure everything is working as expected.

Help! I can’t connect to my SQL Server multi-subnet failover cluster

July 27, 2018 davebermLeave a comment

I get that kind of call or email from customers all the time. I have a generic response as follows…

This has everything you need to know.

https://docs.microsoft.com/en-us/sql/sql-server/failover-clusters/windows/sql-server-multi-subnet-clustering-sql-server?view=sql-server-2017

They don’t go into great detail about what to do if your connection does not support multisubnetfailover=true. If your connection does NOT support that parameter, then set registerallprovidersip to false and cleanup DNS. That procedure is described best here.

https://blogs.msdn.microsoft.com/sambetts/2014/02/04/multi-subnet-clustered-sql-registerallprovidersip-sharepoint-2013/

I figure I get this question often enough I probably should just flesh out my response a bit, hence the reason for this post.

In general people just aren’t aware of how multi-subnet failover clusters work. Multi-subnet failover clustering support was added in Windows Server 2012 with the addition of the “OR” technology when defining cluster resource dependencies. This allowed people to allow a Cluster Name resource to be dependent upon IP Address x.x.x.x OR IP Address y.y.y.y.

x.x.x.x would be an a cluster IP resource valid in Subnet A and y.y.y.y would be a cluster IP address valid in Subnet B. Only one address will be online at any given time, whichever address was valid for the subnet the resource was currently running on.

Microsoft SQL Server started supporting this concept starting with SQL Server 2012 with both failover cluster instances (FCI) using 3-party SANless clustering solutions like SIOS DataKeeper and SQL Server Always On Availability Groups.

By default if you create a SQL Server multi-subnet failover cluster the cluster should be automatically configured optimally, including setting up the two IP addresses, adding two A records to DNS and setting the registerallprovidersIP to true. However, on the client end you need to tell it that you are connecting to a multi-subnet failover cluster, otherwise the connection won’t be made.

Configuring the client

Configuring the client is done by adding multisubnetfailover=true to the connection string. This Microsoft documentation is a great resource, but if you just search for multisubnetfailover=true you will find a lot of information about that setting.

However, not every application will support adding that to the connection string. If you find yourself in that situation you should ask your application vendor to add support for that or show you how to do it.

However, all is not lost if you find yourself in that situation. You will want to change the behavior of the cluster so that upon failover DNS is update so that the single A record associated with the cluster client access point is updated with the new IP address. This is in lieu of having two A records in DNS, one with each cluster IP address, which is the default behavior in an multi-subnet cluster.

This article reference SharePoint, you can ignore that, the rest of the article is pretty well written to describe the process you should follow.

https://blogs.msdn.microsoft.com/sambetts/2014/02/04/multi-subnet-clustered-sql-registerallprovidersip-sharepoint-2013/

The highlights of that article are as follows…

Get-ClusterResource “[Network Name]” | Set-ClusterParameter RegisterAllProvidersIP 0

After restarting the cluster-name-object (basically restarting the role) & cleaning up all “A” records manually (clean-up isn’t done automatically) we can see our old A-records are still in DNS so we’ll need to delete those manually.

In addition to those steps I’d advise you to reduce the TTL on the HostRecordTTL as described in this article.

https://docs.microsoft.com/en-us/powershell/module/failoverclusters/set-clusterparameter?view=win10-ps

The highlight of that article is as follows.

PS C:\> Get-ClusterResource -Name cluster1FS | Set-ClusterParameter -Name HostRecordTTL -Value 300

With a Value of 300 you could potentially be waiting up to 5 minutes for your clients to reconnect after a failover, or even longer if if have a large Active Directory infrastructure and AD replication takes some time to update all the DNS servers across your infrastructure.

You are going to want to figure out what the optimal TTL is to facilitate quick client reconnections without over burdening your DNS servers with a bunch of DNS Lookup requests.

This type of configuration is common in disaster recovery configurations where your DR site is in a different subnet. It is also very common in HA deployments in AWS because different Availability Zones are in different subnets.

Let me know if you have any questions. You can always reach me on Twitter

SQL Server 2017 on Linux Availability Group Split Brain Problem

July 23, 2018July 25, 2018 davebermLeave a comment

On July 18th, 2018 Microsoft published this support article with some guidance to help avoid Split Brain when using Availability Groups with SQL Server on Linux.

https://support.microsoft.com/en-us/help/4341219/split-brain-occurs-after-failover-when-using-alwayson-ags-with-externa

Running SQL Server on Linux can have some advantages, including cost savings on the OS if running in Azure. Run the numbers yourself, as the number of cores go up your cost savings year over year can be substantial, considering you are licensing at least two servers for every cluster pair.

https://azure.microsoft.com/en-us/pricing/calculator/

However, why bother saving money if the technology is not rock solid? One of the biggest issues I see with running SQL Server on Linux is the lack of a cohesive HA/DR story. On Windows, Microsoft owns the whole HA stack and SQL Server relies heavily on Windows Server Failover Clustering to support both Availability Groups and Failover Cluster Instances. This has been running well for many years and has a long track record of success stories.

When moving to Linux, Microsoft no longer owns the HA stack at the OS level and depending upon your distro of Linux, you are left trying to piece together open source solutions like Pacemaker, trying to get things to cooperate with SQL Server Availability Groups.

While you may eventually get it to work, I would much rather look to a 3rd party high availability solution like the SIOS Protection Suite for Linux (SPS-L), giving you a tried and true HA solution for your business critical applications running on Linux.

SPS-L has been protecting business critical applications running on Linux since 1999. It is a full HA/DR solution that monitors and recovers the entire application stack as well as the physical servers and network to ensure your business critical applications are highly available while also maintaining a 3rd copy for disaster recover in a remote datacenter or different geographic region of the cloud.

The other benefit of SPS-L is that it doesn’t require the Enterprise Edition of SQL Server, so there can be a significant cost savings advantage on SQL Server licenses as well. If you consider SQL Server Standard Edition costs $1859 per core vs $7128 per core for SQL Server Enterprise Edition, the cost savings advantage can be significant, depending upon how many cores you need to license.

Below is a video demonstration of SPS-L protecting SQL Server running on Linux in the Azure Cloud. The demonstration shows a SQL Server Standard Edition Cluster being manually failed over between nodes in different Azure Fault Domains as well as SPS-L responding to an unexpected failure.

Cluster Quorum File Share Witness on a USB stick?

July 15, 2018 davebermLeave a comment

I’m very excited to hear that coming in Windows Server 2019 there will be a few new features in regards to the File Share Witness for the Failover Cluster Quorum. The feature that many of my customers have been asking for about for many years is finally arriving…File Share Witness on a USB stick!

Okay, they didn’t really ask for that specifically, but many of my customers wanted to deploy a simple 2-node cluster in each store location, branch office, etc., and they didn’t want the added expense of a SAN to leverage a Disk Witness and weren’t to keen, or just didn’t have the connectivity, to rely on a Cloud Witness in Azure. Many of these customers just decided to forgo clustering, or they used an alternative clustering solution like the SIOS Protection Suite.

Now they have a viable alternative coming in Windows Server 2019. By leveraging a supported router, a USB disk inserted into the router can be configured with a file share that can be used as the witness. This eliminates the need for a 3rd server or internet connectivity.

https://blogs.msdn.microsoft.com/clustering/2018/04/16/new-file-share-witness-feature-in-windows-server-2019/

There are a few scenarios I can imagine, from HCI for Hyper-V, to a simple file server cluster using DataKeeper. Regardless of the scenario, keep in mind unless you plan on building a workgroup cluster, you probably will want to run a VM on each server to act as a redundant Domain Controllers, unless you have a reliable WAN connection back to a Domain Controller hosted in your main datacenter.

Can I put my File Share Witness on a DFS share?

July 15, 2018 daveberm6 Comments

I get asked this question all the time. People are concerned about losing their file share witness, so like many of their other shares, they want to leverage DFS for some additional availability. This is a very bad idea and is not supported.

Microsoft recently publish a great blog article that describes exactly why this is not supported.

https://blogs.msdn.microsoft.com/clustering/2018/04/13/failover-cluster-file-share-witness-and-dfs/

Much of this article would also apply to people who ask if they can use a DataKeeper replicated volume resource as a Disk Share. It makes sense, you can use a DataKeeper volume resource in place of a Physical Disk resource for any other workload, so why not a Disk Witness?

This issue is the same as the DFS issue, in the event of a loss of communication between the two servers there is nothing to guarantee that the volume wouldn’t come online on both servers, causing a potential split-brain condition. The Physical Disk resource overcomes this issue by using SCSI reservations, ensuring the disk is only accessible by one cluster node at a time.

The good news is that Microsoft already blocks you from trying to us a replicated DataKeeper Volume resource and coming in Windows Server 2019 it looks like they will also block you from using a DFS share as a File Share Witness.

Taken from the Failover Clustering and Network Load Balancing Team Blog Post “Failover Cluster File Share Witness and DFS

8th MVP Award

July 1, 2018July 1, 2018 davebermLeave a comment

Really glad to hear today that I’ve been re-awarded the Microsoft Cloud and Datacenter Management MVP award for 2018. It’s a great honor to be counted among some of the smartest people I know. Looking forward to the launch of Windows 2019 and whatever else Microsoft have up their sleeves for Azure in 2019.

High Availability Options for Microsoft SQL Server in the Google Cloud

April 12, 2018 daveberm3 Comments

I was recently interviewed by VMblog about high availability options for SQL Server. You can check out the interview here http://vmblog.com/

For the step by step guide I previously published, check it out here https://clusteringformeremortals.com/2018/01/10/how-to-build-a-sanless-sql-server-failover-cluster-instance-in-google-cloud-platform/

STORAGE SPACES DIRECT (S2D) FOR SQL SERVER FAILOVER CLUSTER INSTANCES (FCI)?

March 29, 2018March 17, 2021 davebermLeave a comment

With the introduction of Windows Server 2016 Datacenter Edition a new feature called Storage Spaces Direct (S2D) was introduced. At a very high level, this solution allows you to pool together locally attached storage and present it to the cluster as a CSV for use in a Scale Out File Server, which can then be accessed over SMB 3 and used to hold cluster data such as Hyper-V VMDK files. This can also be configured in a hyper-converged (HCI) fashion such that the application and data can all run on the same set of servers. This is a grossly over-simplified description, but for details, you will want to look here.

Image taken from https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-overview

The main use case targeted is hyper-converged infrastructure for Hyper-V deployments. However, there are other use cases, including leveraging this SMB storage to store SQL Server Data to be used in a SQL Server Failover Cluster Instance

Why would anyone want to do that? Well, for starters you can now build a highly available 2-node SQL Server Failover Cluster Instance (FCI) with SQL Server Standard Edition, without the need for shared storage. Previously, if you wanted HA without a SAN you pretty much were driven to buy SQL Server Enterprise Edition and make use of Always On Availability Groups or purchase SIOS DataKeeper and leverage the 3rd party solution which lets you build SANless clusters with any version of Windows or SQL Server. SQL Server Enterprise Edition can really drive up the cost of your project, especially if you were only buying it for the Availability Groups feature.

In addition to the cost associated with Availability Groups, there are a number of other technical reasons why you might prefer a Failover Cluster over an AG. Application compatibility, instance vs. database level protection, large number of databases, DTC support, trained staff, etc., are just some of the technical reasons why you may want to stick with a Failover Cluster Instance.

Microsoft lists both the SIOS DataKeeper solution and the S2D solution as two of the supported solutions for SQL Server FCI in their documentation here.

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sql/virtual-machines-windows-sql-high-availability-dr

When comparing the two solutions, you have to take into account that SIOS has been allowing you to build SANless Clusters since 1999, while the S2D solution is still in its infancy. Having said that, there are bound to be some areas where S2D has some catching up to do, or simply features that they will never support simply due to the limitations with the technology.

Have a look at the following table for an overview of some of the things you should consider before you choose your SANless cluster solution.

If we go through this chart, we see that SIOS DataKeeper clearly has some significant advantages. For one, DataKeeper supports a much wider range of platforms, going all the way back to Windows Server 2008 R2 and SQL Server 2008 R2. The S2D solution only supports the latest releases of Windows and SQL Server 2016/2017. S2D also requires the Datacenter Edition of Windows, which can add significantly to the cost of your deployment. In addition, SIOS delivers the ONLY HA/DR solution for SQL Server on Linux that works both on-prem and in the cloud.

I’ve been talking to a lot of customers recently who are reporting some performance issues with S2D. When I tested S2D vs. DataKeeper about a year ago I didn’t see any significant differences in performance, but I did see S2D used about 2x the amount of CPU resources under the same load. This probably has to do with the high hardware requirements associated with S2D such as RDMA enabled networking and available Flash Storage, typically only available in the most expensive cloud based images.

“We recommend the I3 instance size because it satisfies the S2D hardware requirements and includes the largest and fastest instance store devices available.”

But beyond the cost and platform limitations, I think the most glaring gap comes when we start to consider that S2D does not support Availability Zones or disaster recovery configurations such as multi-site clusters or Azure Site Recovery (ASR). Allan Hirt, SQL Server Cluster guru and fellow Microsoft Cloud and Datacenter Management MVP, recently posted about this S2D limitation. In his article Revisiting Storage Spaces Direct and SQL Server FCIs Allan points out that due to the lack of support for stretching S2D clusters across sites or including an S2D based cluster as a leg in an Always On Availability Group, the best option for DR in the S2D scenario is log shipping! This even includes replicating across Availability Zones in either Azure or AWS.

Microsoft does not make it clear in their documentation, but Microsoft’s own PM for High Availability and Storage makes it perfectly clear in the Microsoft forums.

AWS also documents S2D’s lack of Availability Zone support…

“Each cluster node must be deployed in a different subnet. This architecture will be deployed into a single availability zone because Microsoft does not currently support stretch cluster with Storage Spaces Direct. ” – AWS Documentation on S2D

Deploying S2D cluster nodes within the same Availability Zone defeats the purpose of failover clustering and the deployment does not qualify for the AWS 99.99% SLA. Even if you wanted to deploy S2D in a single Availability Zone the deployment becomes even more complicated because it is recommended that you deploy at least three cluster nodes and each node must reside in its own subnet due to some AWS networking restrictions that requires each cluster node reside in a different subnet. S2D was never designed to run in different subnets, which further complicates the solution in terms of client redirection.

You can also find this statement in Disaster Recovery Scenarios for Hyper-Converged Infrastructure | Microsoft Docs

“One item to note is that if you are familiar with Failover Clusters in the past, stretch clusters have been a very popular option over the years. There was a bit of a design change with the hyper-converged solution and it is based on resiliency. If you lose two nodes in a hyper-converged cluster, the entire cluster will go down. With this being the case, in a hyper-converged environment, the stretch scenario is not supported.”
https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-disaster-recovery

In contrast, the SIOS DataKeeper solution fully supports Always On Availability Groups, and better yet – it can allow you to stretch your FCI across sites to give you the best HA/DR solution you could hope to achieve in terms of RTO/RPO. DataKeeper supports Availability Zones and DR configurations that cross cloud regions. In an Azure environment, DataKeeper also support Azure Site Recovery (ASR), giving you even more options for disaster recovery.

Further complicating any S2D deployment in AWS is the reliance on “local instance store” storage, AKA, non-persistent ephemeral disks.

“The best performance for storage can be achieved using I3 instances because they provide local instance store with NVMe and high network performance”

Reliance on ephemeral storage puts your data at risk any time a disk rebuilds, which can happen at any time, but always happens when an instance is stopped. If a disk is lost and a second disk is lost before the first disk rebuilds you are looking at complete data loss and a restore from backup. If someone accidentally stops all the nodes in your cluster your data will be lost! Even if you take care to only stop one node at a time if you are not paying attention and waiting for a disk to complete a rebuild after you stop the second node you will also experience complete data loss!

The rest of this chart is pretty self explanatory. It basically consist of a list hardware, storage and networking requirements that must be met before you can deploy an S2D cluster. An exhaustive list of S2D requirements is maintained here. https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-hardware-requirements

The SIOS DataKeeper solution is much more lenient. It supports any locally attached storage and as long as the hardware passes cluster validation, it is a supported cluster configuration. The block level replication solution has been working great ever since 1 Gbps was considered a fast LAN and a T1 WAN connection was considered a luxury.

SANless clustering is particularly interesting for cloud deployments. The cloud does not offer traditional shared storage options for clusters. So for users in the middle of a “lift and shift” to the cloud that want to take their clusters with them they must look at alternate storage solutions. For cloud deployments, SIOS is certified for Azure, AWS and Google and available in the relevant cloud marketplace. While there doesn’t appear to be anything blocking deployment of S2D based clusters in AWS or Google, there is a conspicuous lack of documentation or supportability statements from Microsoft for those platforms.

SIOS DataKeeper has been doing this since 1999. SIOS has heard all the feature requests, uncovered all the bugs, and has a rock solid solution for SANless clusters that is time tested and proven. While Microsoft S2D is a promising technology, as a 1st generation product I would wait until the dust settles and some of the feature gap closes before I would consider it for my business critical applications.

Clustering For Mere Mortals

Microsoft Cloud and Datacenter MVP David Bermingham's thoughts and advice on Windows clustering and other related technologies

High Availability