How To Linux: A Windows Administrator’s Guide to Linux for the Newbie

Well, it’s long overdue that I left the comfort of my Windows GUI and ventured into the world of Linux. Mind you I have dabbled a very little bit over the years, watched some training videos about 18 years ago, and even installed Ubuntu on an old laptop at that time. I never ventured far past the GUI that was available as I recall. I think I muddled through an install of SQL Server on Linux once. I relied a lot on Google and help from co-workers.

This time it’s for real. I’ve signed up for some college classes and will be earning a Certificate in LInux/Unix Administration. I’ll be completing this journey with my oldest son, who is considering joining me in the field of information technology. 

I’m going to try to document everything I learn along the way, so that it might help someone else in their journey, but mostly so I can remember what I did the next time I have to do it. Now keep in mind, I have NO IDEA if what I am doing is the right way, best way, or most secure way of doing things. So anything you read should be taken with a grain of salt, and if you are actually administering a production workload you probably should get advice from a more experienced Linux expert.  And if you ARE a Linux expert, please feel free to add some comments and tell me what I am doing wrong or how I could do things better!

This will probably be the first in a series of articles if everything goes to plan. 

Linux Day 1

I haven’t started class yet, but I bought the recommended book. The Linux Command Line, 2nd Edition: A Complete Introduction. I quickly learned that there are some assumptions being made that aren’t covered in the text. For instance, it assumes you know how to install and connect to some version of Linux. For me, the easiest thing to do was to use some of my Azure credits and spin up a Linux VM in the cloud. I won’t go through all the details of what I did in Azure, but basically I spun up a Red Hat Enterprise Linux 8.2 VM and opened up SSH port 22 so I could connect remotely. I also used an SSH public key for connectivity.

So great, my VM is running. Now how do I connect?

Connecting from a Mac

My main PC is a Macbook Pro. After a little searching around, I decided I would use the Terminal program on my first attempt at connecting to my instance. I discovered that you could create a “New Remote Connection”. If I recall, when I used Windows I used a program called PuTTY.

Through some trial and error and some Google searches, I finally found the magic combination which allowed me to connect.

CHMOD

CHMOD is one of the things I do recall from my very limited experience with Linux. It basically is the way to change file permissions. I don’t know all the ins and outs of CHMOD yet, but what I found out is that before I could connect to my instance, I had to lock down the permissions on the private key that I downloaded when I created the Linux image. This was the error message I received.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Permissions 0644 for ‘/Users/davidbermingham/Downloads/Linux1_key.pem’ are too open.

It is required that your private key files are NOT accessible by others.

This private key will be ignored.

Load key “/Users/davidbermingham/Downloads/Linux1_key.pem”: bad permissions

root@xx.xx.xx.xxx: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

[Process completed]

I discovered that I needed to run the following command to lock down the permissions on the private key so that only the owner of the file has full read/write access.

Last login: Sun Aug 28 17:59:26 on ttys005

MacBook-Pro-2:~ davidbermingham$ chmod 0600 /Users/davidbermingham/Downloads/Linux1_key.pem

MacBook-Pro-2:~ davidbermingham$ 

Connect with SSH

After I changed the permissions I was able to connect with SSH as shown below.

Last login: Sun Aug 28 18:00:09 on ttys006

MacBook-Pro-2:~ davidbermingham$ ssh -i /Users/davidbermingham/Downloads/Linux1_key.pem azureuser@xx.xx.xxx.xxx

Activate the web console with: systemctl enable –now cockpit.socket

This system is not registered to Red Hat Insights. See https://cloud.redhat.com/

To register this system, run: insights-client –register

Last login: Sun Aug 28 21:19:38 2022 from 98.110.69.71

[azureuser@Linux1 ~]$ 

Now What?

Now that I have what appears to be a working terminal into my Linux VM, I can move on in Chapter 1 in my book. But before I do that, I’m already thinking about how I could use my iPad Pro to open a terminal over SSH to my cloud instance. I think I much rather drag that to class than my whole laptop. A quick search tells me that there are apps that make that entirely possible, so I’ll be looking into that as well.

Finishing chapter 1, I learned what the following commands do: date, cal, df, free, exit. 

Try them out on your own. I’m moving on to Chapter 2.

Chapter 2: Navigation

In the first few paragraphs of Chapter 2 I learned something that clears up years of confusion on my part. Much like Windows, Linux has a hierarchical directory structure. However, there is only ever one Root directory and single file system tree. If you attach other disks these disks will be mounted in the directory structure wherever the system administrator decides to mount the disks.

Here are some random commands introduced in Chapter 2. They are pretty self explanatory.

pwd – Print Working Directory

pwd will show you the current directory you are working in.

[azureuser@Linux1 ~]$ pwd

/home/azureuser

[azureuser@Linux1 ~]$ 

ls – List Contents of Directory

Fun fact, filenames that begin with a period are hidden. In order to see hidden files you need to use ls -a

cd – Change Directory

Absolute Path Names
Relative Path Names

Specifies the directory relative to the current directory

. (dot), .. (dot, dot)

“cd”  changes to home directory

“cd -” changes to previous working directory

“cd ~username” changes to that user home directory

Some Fun Facts
  • Filenames and Commands are case sensitive in Linux
  • Do not use spaces in filenames. You can use period, dash or underscore. Best to use underscore to represent space in a filename
  • Linux operating system has no concept of file extensions, but some applications do.

That’s what I learned on day one. I got through the first two chapters and I feel like I’ll be going into class a little ahead of the game. Now I have to get my son to crack his book and show him what I learned.

I’ll pick up the series again later this week.

How To Linux: A Windows Administrator’s Guide to Linux for the Newbie

Performance of Azure Shared Disk with Zone Redundant Storage (ZRS)

On September 9th, 2021, Microsoft announced the general availability of Zone-Redundant Storage (ZRS) for Azure Disk Storage, including Azure Shared Disk.

What makes this interesting is that you can now build shared storage based failover cluster instances that span Availability Zones (AZ).  With cluster nodes residing in different AZs, users can now qualify for the 99.99% availability SLA. Prior to support for ZRS, Azure Shared Disks only supported Locally Redundant Storage (LRS), limiting cluster deployments to a single AZ, leaving users susceptible to outages should an AZ go offline.

There are however a few limitations to be aware of when deploying an Azure Shared Disk with ZRS.

  • Only supported with premium solid-state drives (SSD) and standard SSDs. Azure Ultra Disks are not supported.
  • Azure Shared Disks with ZRS are currently only available in West US 2, West Europe, North Europe, and France Central regions
  • Disk Caching, both read and write, are not supported with Premium SSD Azure Shared Disks
  • Disk bursting is not available for premium SSD
  • Azure Site Recovery support is not yet available.
  • Azure Backup is available through Azure Disk Backup only.
  • Only server-side encryption is supported, Azure Disk Encryption is not currently supported.

I also found an interesting note in the documentation.

“Except for more write latency, disks using ZRS are identical to disks using LRS, they have the same scale targets. Benchmark your disks to simulate the workload of your application and compare the latency between LRS and ZRS disks.”

While the documentation indicates that ZRS will incur some additional write latency, it is up to the user to determine just how much additional latency they can expect. A link to a disk benchmark document is provided to help guide you in your performance testing.

Following the guidance in the document, I used DiskSpd to measure the additional write latency you might experience. Of course results will vary with workload, disk type, instance size, etc.,but here are my results.

Locally Redundant Storage (LRS)Zone Redundant Storage (ZRS)
Write IOPS5099.824994.63
Average Latency7.8307.998

The DiskSpd test that I ran used the following parameters. 


diskspd -c200G -w100 -b8K -F8 -r -o5 -W30 -d10 -Sh -L testfile.dat

I wrote to a P30 disk with ZRS and a P30 with LRS attached to a Standard DS3 v2 (4 vcpus, 14 GiB memory) instance type. The shared ZRS P30 was also attached to an identical instance in a different AZ and added as shared storage to an empty cluster application.

A 2% overhead seems like a reasonable price to pay to have your data distributed synchronously across two AZs. However, I did wonder what would happen if you moved the clustered application to the remote node, effectively putting your disk in one AZ and your instance in a different AZ. 

Here are the results.

Locally Redundant Storage (LRS)Zone Redundant Storage (ZRS)ZRS when writing from the remote AZ
Write IOPS5099.824994.634079.72
Average Latency7.8307.9989.800

In that scenario I measured a 25% write latency increase. If you experience a complete failure of an AZ, both the storage and the instance will failover to the secondary AZ and you shouldn’t experience this increase in latency at all. However, other failure scenarios that aren’t AZ wide could very well have your clustered application running in one AZ with your Azure Shared Disk running in a different AZ. In those scenarios you will want to move your clustered workload back to a node that resides in the same AZ as your storage as soon as possible to avoid the additional overhead.

Microsoft documents how to initiate a storage account failover to a different region when using GRS, but there is no way to manually initiate the failover of a storage account to a different AZ when using ZRS. You should monitor your failover cluster instance to ensure you are alerted any time a cluster workload moves to a different server and plan to move it back just as soon as it is safe to do so.

You can find yourself in this situation unexpectedly, but it will also certainly happen during planned maintenance of the clustered application servers when you do a rolling update. Awareness is the key to help you minimize the amount of time your storage is performing in a degraded state.

I hope in the future Microsoft allows users to initiate a manual failover of a ZRS disk the same as they do with GRS. The reason they added the feature to GRS was to put the power in the hands of the users in case automatic failover did not happen as expected. In the case of ZRS I could see people wanting to try to tie together storage and application, ensuring they are always running in the same AZ, similar to how host based replication solutions like SIOS DataKeeper do it.

Performance of Azure Shared Disk with Zone Redundant Storage (ZRS)

Facebook, Instagram and WhatsApp just had a really bad Monday

It’s the end of the work day here on the east coast and I see that the Facebook is still unavailable. Facebook acknowledged the problem in the following two Tweets.

I can pinpoint the time that Facebook went offline for me. I was trying to post a comment on a post and my comment was not posting. I was a little annoyed, and almost thought the poster had blocked me, or was deleting my comment. This was at 11:45 am EDT. 5+ hours Facebook for me is still down.

While we don’t know the exact cause of the downtime, and whether it was user error, some nefarious assault, or just an unexpected calamity of errors, we can learn a few things about this outage at this point.

Downtime is expensive

While we may never know the exact cost of the downtime experienced today, there are a few costs that can already measured. As of this writing, Facebook stock went down 4.89% today. That’s on top of an already brutal September for Facebook and other tech stocks.

The correction may have been inevitable, but the outage today certainly didn’t help matters.

But what was the real cost to the company? With many brands leveraging social media as an important part of their marketing outreach, how will this outage impact future advertising spends? Minimally I anticipate advertisers to investigate other social media platforms if they have not done so already. Only time will tell, but even before this outage we have seen more competition for marketing spend from other platforms such as TickTock.

Plan for the worst-case scenario

Things happen, we know that and plan for that. Business Continuity Plans (BCP) should be written to address any possible disaster. Again, we don’t know the exact cause of this particular disaster, but I would have to imaging that an RTO of 5+ hours is not written into any BCP that sits on the shelf at Facebook, Instagram or WhatsApp.

What’s in your your BCP? Have you imagined any possible disaster? Have your measured the impact of downtime and defined adequate recovery time objective (RTO) and recovery point objective (RPO) for each component of your business? I would venture to say that it’s impossible to plan for every possible thing that can go wrong. However, I would advise everyone to revisit your BCP on a regular basis and update it to include disasters that maybe weren’t on the radar the last time you reviewed your BCP. Did you have global pandemic in your BCP? If not, you may have been left scrambling to accomodate a “work from home” workforce. The point is, plan for the worst and hope for the best.

Communications in a disaster

Communications in the event of a disaster should be its own chapter in your BCP.

One Facebook employee told Reuters that all internal tools were down. Facebook’s response was made much more difficult because employees lost access to some of their own tools in the shutdown, people tracking the matter said.

Multiple employees said they had not been told what had gone wrong.

https://www.reuters.com/technology/facebook-instagram-down-thousands-users-downdetectorcom-2021-10-04/

A truly robust BCP must include multiple fallback means of communication. This becomes much more important as your business spreads out across multiple building, regions or countries. Just think about how your team communicates today. Phone, text, email, Slack might be your top four. But what if they are all unavailable, how would you reach your team? If you don’t know you may want to start investigating other options. You may not need a shortwave radio and a flock of carrier pigeons, but I’m sure there is a government agency that keeps both of those on hand for a “break glass in case of emergency” situation.

Summary

You have a responsibility to yourself, your customers and your investors to make sure you take every precaution concerning the availability of your business. Make sure you invest adequate resources in creating your BCP and that the teams responsible for business continuity have the tools they need to ensure they can do their part in meeting the RTO and RPO defined in your BCP.

Facebook, Instagram and WhatsApp just had a really bad Monday

AWS BYOC – Bring your own SQL Server cluster with AWS EC2 and SIOS

Nice blog post from SQL Server expert Steve Rezhener.

DataSteve


Introduction

They are multiple options to implement HADR solution for SQL Server in AWS public cloud. The easiest way to do that is to use AWS Database as a Service (DBaaS) product known as AWS RDS and enable Multi-AZ option using your AWS Control Panel (Fig #1).

See the source image
Fig #1

That is all it takes to roll out or add it to an existing instance. AWS will take care of the rest under the hood (depending on your SQL Server version, it might use Mirroring or Always On Availability Groups). It will provision all the necessary things for an automatic failover (witness, network, storage, etc…), so when a primary node will go down, it will be replaced with a secondary node. No ifs and no buts. This blog post is going to discuss how to build your own HADR solution using multiple AWS EC2s (BYOC) instead of a managed AWS…

View original post 1,104 more words

AWS BYOC – Bring your own SQL Server cluster with AWS EC2 and SIOS

Clustering SAP #ASCS and #ERS on #AWS using Windows Server Failover Clustering

When ensuring high availability for SAP ASCS and ERS running on WIndows Server, the primary cluster solution you will want to use is Windows Server Failover Clustering. However, when doing this in AWS you will quickly discover that there are a few obstacles you need to know how to overcome when deploying this in AWS.

I recently wrote this Step-by-Step guide that was published on the SAP blog that walks you through the entire process. If you have any questions, please leave a comment.

Clustering SAP #ASCS and #ERS on #AWS using Windows Server Failover Clustering

How to create a DataKeeper Replicated Volume that has Multiple Targets via CLI

I often help people automate the configuration of their infrastructure so they can build 3-node clusters that span Availability Zones and Regions. The CLI for creating a DataKeeper Job and associated mirrors that contain more than one target can be a little confusing, so I’m documenting it here in case you find yourself looking for this information. The DataKeeper documentation describes this as a Mirror with Multiple Targets.

The environment in this example looks like this:

PRIMARY (10.0.2.100) – in AZ1
SECONDARY (10.0.3.100) – in AZ2
DR (10.0.1.10) – in a different Region

I want to create a synchronous mirror from PRIMARY to SECONDARY and an asynchronous mirror from PRIMARY to DR. I also have to make sure the DataKeeper Job knows how to create a mirror from SECONDARY to DR in case the SECONDARY or DR server ever become the source of the mirror. EMCMD will be used to create this multiple target mirror.

We need to first create the Job that contains all this possible endpoints and define whether the mirror will be Sync (S) or Async (A) between those endpoints.

emcmd . createjob ddrive sqldata primary.datakeeper.local D 10.0.2.100 secondary.datakeeper.local D 10.0.3.100 S primary.datakeeper.local D 10.0.2.100 dr.datakeeper.local D 10.0.1.10 A secondary.datakeeper.local D 10.0.3.100 dr.datakeeper.local D 10.0.1.10 A

That single “createjob” command creates the Job. It might be a little easier to look at that command like this:

emcmd . createjob ddrive sqldata 
primary.datakeeper.local D 10.0.2.100 secondary.datakeeper.local D 10.0.3.100 S 
primary.datakeeper.local D 10.0.2.100 dr.datakeeper.local D 10.0.1.10 A 
secondary.datakeeper.local D 10.0.3.100 dr.datakeeper.local D 10.0.1.10 A

Next we need to create the mirrors.

emcmd 10.0.2.100 createmirror D 10.0.1.10 A
emcmd 10.0.2.100 createmirror D 10.0.3.100 S

Our DataKeeper Job should now look like this in the DataKeeper UI

One-to-many DataKeeper Replicated Volume

And then finally we can register the DataKeeper Volume Resource in the cluster Available Storage with this command.

emcmd . registerclustervolume D

The DataKeeper Volume Resource will now appear in Available Storage as shown below.

DataKeeper Volume in Available Storage

You are now ready to install SQL Server, SAP, File Server or any other clustered resource you normally protect with Windows Server Failover Clustering.

How to create a DataKeeper Replicated Volume that has Multiple Targets via CLI

How to extend your existing SQL Server Failover Cluster Instance to the Cloud for Disaster Recovery

I get asked this question all the time, so I figured it was time to write a blog post, record a video and write some code to automate the process so that it completes in under a minute.

First, some background. Typically when someone asks me how to do this, I point them to the DataKeeper documentation.

https://docs.us.sios.com/WindowsSPS/8.2/DKLE/DKCloudTechDoc/Content/DataKeeper/User_Guide/Extending_a_Traditional_2-Node_WSFC_Cluster_to_a_Third_Node_via_DataKeeper.htm.

This first document talks about extending the cluster and adding a 3rd node to the existing cluster. That’s fine if your cluster supports three nodes, but if you are using SQL Server Standard Edition, Microsoft limits you to a 2-node cluster. In the case of a 2-node cluster you can still replicate to a 3rd node, but the recovery will be more of a manual process. This process is described here.

https://docs.us.sios.com/WindowsSPS/8.6/SPS4W/TechDoc/Content/DataKeeper/User_Guide/Extend_to_Non_Cluster_Node.htm

People typically read these instructions and get a little worried. They feel like they would be performing open heart surgery on their cluster. It really is more like changing your shirt! You are simply replacing the Cluster Disk resource with a DataKeeper Volume resource. As you’ll see in the video below the process takes just a few seconds.

The code demonstrated in the video is show below.

Stop-ClusterGroup SQLServerGroup
Remove-ClusterResource -Name "Cluster Disk 1"
Set-Disk -Number 4 -IsOffline $False
Set-Disk -Number 4 -IsReadOnly $False
Import-Module -Name Storage
Set-Partition -DiskNumber 4 -PartitionNumber 1 -NewDriveLetter X
New-DataKeeperMirror -SourceIP 10.0.2.100 -SourceVolume X -TargetIP 10.0.1.10 -TargetVolume X -SyncType Sync
New-DataKeeperJob -JobName "x drive" -JobDescription "sql data" -Node1Name primary.datakeeper.local -Node1IP 10.0.2.100 -Node1Volume x -Node2Name dr.datakeeper.local -Node2IP 10.0.1.10 -Node2Volume X -SyncType Sync
Add-ClusterResource -Name "DataKeeper Volume X" -ResourceType "DataKeeper Volume" -Group "SQLServerGroup"
Get-ClusterResource "DataKeeper Volume X" | Set-ClusterParameter VolumeLetter X
Get-ClusterResource -Name 'SQLServer' | Add-ClusterResourceDependency -Provider 'DataKeeper Volume X'
Start-ClusterGroup SQLServerGroup 

After you run that code don’t forget you also need to click on Manage Shared Volumes to add the backup node to the DataKeeper job as shown in the video.

If you have SQL Server Enterprise Edition then the final step would be to install SQL Server in the DR node and choose add node to existing cluster.

If you are using SQL Server Standard Edition then your job is done. You would simply follow these instructions to access you data on the 3rd node and then mount the replicated databases.

These directions are applicable whether your DR node is in the Cloud or your own DR site.

How to extend your existing SQL Server Failover Cluster Instance to the Cloud for Disaster Recovery

Changing a Drive Letter with PowerShell

Here is a short but sweet post on how to change the drive letter of a partition. Despite using my best Google skills I couldn’t find an example that was doing it for me. I rolled up my sleeves and just figured it out on my own. I hope this helps someone out there.

Set-Partition -DiskNumber 4 -PartitionNumber 1 -NewDriveLetter X

As long as you know the DiskNumber and PartitionNumber this will immediately change the drive letter of the partition you specify.

You may also need to import the Storage module into Powershell before you can do this.

Import-Module -Name Storage

Let me know if this helped you

Changing a Drive Letter with PowerShell

Using Amazon FSx for SQL Server Failover Cluster Instances – What You Need to Know!

Intro

If you are considering deploying your own Microsoft SQL Server instances in AWS EC2 you have some decisions to make regarding the resiliency of the solution. Sure, AWS will offer you a 99.99% SLA on your Compute resources if you deploy two or more instances across different availability zones. But don’t be fooled, there are a lot of other factors you need to consider when calculating your true application availability. I recently blogged about how to calculate your application availability in the cloud. You probably should have a quick read of that article before you move on.

When it comes to ensuring your Microsoft SQL Server instance is highly available, it really comes down to two basic choices: Always On Availability Group (AG) or SQL Server Failover Cluster Instance (FCI). If you are reading this article I’m making an assumption you are well aware of both of these options and are seriously considering using a SQL Server FCI instead of a SQL Server Always On AG.

Benefits of a Microsoft SQL Server Failover Cluster Instance

The following list summarize what AWS says are the benefits of a SQL Server FCI:

FCI is generally preferable over AG for SQL Server high availability deployments when the following are priority concerns for your use case:

License cost efficiency: You need the Enterprise Edition license of SQL Server to run AGs, whereas you only need the Standard Edition license to run FCIs. This is typically 50–60% less expensive than the Enterprise Edition. Although you can run a Basic version of AGs on Standard Edition starting from SQL Server 2016, it carries the limitation of supporting only one database per AG. This can become a challenge when dealing with applications that require multiple databases like SharePoint.

Instance-level protection versus database-level protection: With FCI, the entire instance is protected – if the primary node becomes unavailable, the entire instance is moved to the standby node. This takes care of the SQL Server logins, SQL Server Agent jobs, certificates, etc. that are stored in the system databases, which are physically stored in shared storage. With AG, on the other hand, only the databases in the group are protected, and system databases cannot be added to an AG – only user databases are allowed. It is the database administrator’s responsibility to replicate changes to system objects on all AG replicas. This leaves the possibility of human error causing the database to become inaccessible to the application.

DTC feature support: If you’re using SQL Server 2012 or 2014, and your application uses Distributed Transaction Coordinator (DTC), you are not able to use an AG as it is not supported. Use FCI in this situation.

https://aws.amazon.com/blogs/storage/simplify-your-microsoft-sql-server-high-availability-deployments-using-amazon-fsx-for-windows-file-server/

Challenges with FCI in the Cloud

Of course, the challenge with building an FCI that spans availability zones is the lack of a shared storage device that is normally required when building a SQL Server FCI. Because the nodes of the cluster are distributed across multiple datacenters, a traditional SAN is not a viable option for shared storage. That leaves us with a two choices for cluster storage: 3rd party storage class resources like SIOS DataKeeper or the new Amazon FSx. Let’s take a look at what you need to know before you make your choice.

Service Level Agreement

As I wrote in how to calculate your application availability, your overall application SLA is only as good as your weakest link. In this case, the FSx SLA of 99.9%.

Normally 99.99% availability represents the starting point of what is considered “highly available”. This is what AWS promises you for your compute resources when two or more are deployed in different availability zones.

In case you didn’t know the difference between three nines and four nines…

  • 99.9% availability allows for 43.83 minutes of downtime per month
  • 99.99% availability allows for only 4.38 minutes of downtime per month

By hosting your cluster storage on FSx despite your 99.99% compute availability, your overall application availability will be 99.9%. In contrast, EBS volumes that span availability zones, such as in a DataKeeper deployment, qualifies for the 99.99% SLA at both the storage and compute layers, meaning your overall application availability is 99.99%.

Storage Location

When configuring FSx for high availability, you will want to enable multi-AZ support. By enabling multi-AZ you have an effectively have a “preferred” AZ and a “standby” AZ. When you deploy your SQL Server FCI nodes you will want to distribute those nodes across the same AZs.

Now in normal situations, you will want to make sure the active cluster node resides in the same AZ as the preferred FSx storage node. This is to minimize the distance and latency to your storage, but also to minimize the costs associated with data transfer across AZs. As specified in the FSx price guide, “Standard data transfer fees apply for inter-AZ or inter-region access to file systems.”

In the unfortunate circumstance where you have a SQL Server FCI failure, but not a FSx failure, there is no mechanism to tie both the storage and compute together. In the event that FSx fails over, it will automatically fail back to the primary availability zone. However, best practices dictate SQL FCI remain running on the secondary node until root cause analysis is performed and fail back is typically scheduled to occur during maintenance periods. This leaves you in a situation where your storage resides in a different AZ, which will incur additional costs. Currently the cost for transferring data across AZs, both ingress and egress, is $0.01/GB.

Without keeping a close eye on the state of your FSx and SQL Server FCI, you may not even be aware that they are running in different regions until you see the data transfer charge at the end of the month.

In contrast, in a configuration that use SIOS DataKeeper, the storage failover is part of the SQL Server FCI recovery, ensuring that the storage always fails over with the SQL Server instance. This ensures SQL Server is always reading and writing to the EBS volumes that are directly attached to the active node. Keep in mind, DataKeeper will incur a data transfer cost associated with write operations which are replicated between AZs or regions. This data transfer cost can be minimized with the use of compression available in DataKeeper.

Controlling Failover

In an FSx multi-subnet configuration there is a preferred availability zone and a standby availability. Should the FSx file server in the preferred availability zone experience a failure, the file server in the standby AZ will recover. AWS reports that this recovery time takes about 30 seconds with standard shares. With the use of continuously available file shares Microsoft reports this failover time can be closer to 15 seconds. During this failover time, there is a brownout that occurs where reads and writes are paused, but will continue once recovery completes.

FSx multi-site has automatic failback enabled, meaning that for every unplanned failover of FSx, you also have an unplanned failback. In contrast, typically when a SQL Server FCI experience an unplanned failover you would either just leave it running on the secondary or schedule a failback after hours or during the next maintenance period.

SQL Server Analysis Services Cluster Not Supported with FSx

If you want to cluster SSAS, you will need a clustered disk resource like SIOS DataKeeper. The How to Cluster SQL Server Analysis Server white paper clearly states that SMB cannot be used and that cluster drives with drive letters must be used. In contrast, the DataKeeper Volume resource presents itself as a clustered disk and can be used with SSAS.

Summary

While FSx certainly can make sense for typical SMB uses like Windows user files and other non-critical services where 99.9% availability SLA suffices, FSx is an excellent option If you application requires high availability (99.99%) or HA/DR solutions that also span regions, the SIOS DataKeeper is the right fit.

Using Amazon FSx for SQL Server Failover Cluster Instances – What You Need to Know!

Calculating Application Availability in the Cloud

When deploying business critical applications in the cloud you want to make sure they are highly available. The good news is that if you plan properly, you can achieve 99.99% (4-nines) of availability or more. However, calculating your true availability may not be as straightforward as it seems.

When considering availability you must consider the key components that make access to your application possible, which I’ll call the availability chain. Component of the availability chain are:

  • Compute
  • Network 
  • Storage
  • Application
  • Dependent services

Your application is only as available as your weakest link, and your downtime increases exponentially with each additional link you add to the chain.  Let’s examine each of the links. 

Compute Availability

Each of the three major cloud service providers have some similarities. One thing in common across all three platforms is the service level agreements (SLA) they will commit to for compute.

The SLA for all three public cloud providers for VMs when you have two or more VMs configured across different availability zones is 99.99%. Keep in mind, this SLA only guarantees the remote accessibility of one of the VMs at any given time, it makes no promises as to the availability of the services or application(s) running inside the VM. If you deploy a single VM within a single datacenter, this SLA varies from “90% of each hour” (AWS) to 99.5% (Azure and GCP) or 99.9% (Azure single VM when using Premium SSD).

True high availability starts at 99.99%, so the first step is to ensure your application is available is to make sure the application is distributed across two or more VMs that span availability zones. With two VMs spread across two availability zones, giving you 99.99% availability of at least one of those VMs, you could theorize that if you had three VMs spread across three availability zones your availability would be even greater than 99.99%. Although the cloud providers’ SLA will never guarantee beyond 99.99% availability regardless of the number of availability zones in use, if you use pure statistics you might come to the conclusion that your availability could jump to as high as 99.999999% or 8-nines of availability, 26.30 milliseconds downtime per month.

1-(.0001*.0001) = .99999999

99.999999% availability with three availability zones?

Don’t go around quoting that number, but just keep in mind that it makes sense that if two availability zones can give you 99.99% availability, it stands to reason that three availability zones is going to give you something significantly more than 99.99% availability.

Compute is just one link in the availability chain. We still have to address network, storage and other dependent services, which all represent possible points of failure.

Network Availability

In order for your application to be available, every network hop between the client and the application and all the resources that the application depends on, must be available and working within tolerable latency ranges. You need to understand the network links between database servers, application servers, web servers and clients to know precisely where the network might fail. And remember, the more links in your availability chain the lower your overall availability will be.

Although network availability betweens VMs in the same vNet are covered under the standard compute SLA, there are other network services that you may be utilizing. Here are just a few examples of network services you could be utilizing which would impact overall application availability.

Express Route – 99.95%
VPN Gateway – 99.9% through 99.95%
Load Balancer – 99.99%
Traffic Manager – 99.99%
Elastic Load Balancer – 99.99%
Direct Connect – 99.9% – 99.99%

Building on what we have learned so far, let’s take a look at the availability of an application that is deployed across two availability zones. 

99.99% compute availability

99.99% load balancer availability

.9999 * .9999 = .9998

99.98% availability = ~9 minutes downtime per month

Now that we have addressed compute and network availability, let’s move on to storage.

Storage Availability

Now here is where the story gets a little hairy. Have a look at the following storage SLAs

https://azure.microsoft.com/en-us/support/legal/sla/storage/v1_5/

https://cloud.google.com/storage/sla

https://aws.amazon.com/compute/sla/

It seems pretty clear that Azure and Google are giving you a 99.9% SLA on block storage solutions. AWS doesn’t mention EBS specifically here. They only talk about VMs and measure their single instance VMs availability by the hour instead of by the month as the other cloud providers do. For sake of discussion, lets use the 99.9% availability guarantee that both Azure and GCP have published.

Building upon our previous example, let’s add some storage to the equation.

99.99% compute availability

99.99% load balancer availability

99.9% managed disk

.9999 * .9999 * .999 = .9988

99.88% availability = ~53 minutes of downtime per month.

53 minutes of downtime is a lot more than the 9 minutes of downtime we calculated in our previous example. What can we do to minimize the impact of the 99.9% storage availability? We have to build more redundancy in the storage! 

Fortunately, we usually include storage redundancy when planning for application availability. For instance, when we stand up web servers, each web server will typically store data on the locally attached disk. When deploying domain controllers, Microsoft Active Directory takes care of replicating AD information across all the domain controllers. In the case of something like SQL Server, we leverage things Always On Availability Groups or SIOS DataKeeper to keep the data in sync across locally attached disks.

The more copies of the data we have distributed across different availability zones, the more likely we will be able to survive a failure.

For example, an application that stores its data across two different disks in different availability zones will benefit from the redundancy and instead of 99.9% availability it is more likely to achieve 99.9999% availability of the storage.

1 – (.001 * .001) = .999999

If we throw that into the previous equation the picture starts to look a little brighter.

.9999 * .9999 * .999999 = .9998

99.98% availability = ~9 minutes of downtime

By duplicating the data across multiple AZs, and therefore multiple disks, we have effectively mitigated the downtime associated with cloud storage.

Application and Dependent Services Availability

You’ve done all you can do to ensure compute, network, and storage availability. But what about the application itself? Some applications can scale out and provide redundancy by load balancing between multiple instances of the same application. Think of your typical web server farm where you may typically load balance web requests between five servers. If you lose one server the load balancer simply removes it from it’s rotation until it is once again responsive.

Other applications require a little more care and monitoring. Take SQL Server for instance. Typically Always On Availability Groups or Failover Cluster Instances are used to monitor database availability and take recovery actions should a database become unresponsive due to application or system level failures. While there is no published SLA for SQL Server availability solutions, it is commonly accepted that when configured properly for high availability, a SQL Server can provide 99.99% availability.

You may rely on other cloud based services, like hosted Active Directory, hosted DNS, microservices, or even the availability of the cloud portal itself should all be factored into your overall availability equation.

Summary

Application availability is the sum of all the moving parts. Skimping in just one area can exponentially impact the overall availability of your application. Take your time and investigate all the links in your availability chain for weakness including compute, network, storage, application and dependent services.

In general the numbers presented here are hopefully worst case scenarios and your actual availability should exceed the published SLAs. Do your homework and be wary of any service that can not guarantee 99.99% availability, the typical threshold of what is considered highly available.

Human error and security were not addressed in this article. You can make your application as highly available as possible, but if you have not taken steps to secure your application against external threats and stupid human mistakes then all bets are off when it comes to availability.

Calculating Application Availability in the Cloud