Cross-site scripting (XSS) attacks

Today, we’re diving into the world of cross-site scripting (XSS) attacks, breaking them down into three categories: Reflected XSS, Stored XSS, and DOM XSS. Let’s explore these digital threats and learn how they can impact everyday users like you and me.

Reflected XSS – The Click-Trap:
Imagine you receive a seemingly innocent link through email, chat, or social media. You click on it, unaware that it contains a hidden script. This script bounces from the website to your browser, where it runs and wreaks havoc. It could steal your sensitive information or carry out actions as if it were you. The key to avoiding this trap? Be cautious and think twice before clicking on any unfamiliar links!

Stored XSS – The Web Page Booby Trap:
In a stored XSS attack, a devious attacker plants a script into a website’s database or storage. The script blends in with the site’s regular content and lies in wait. When you visit the affected page, the script springs into action, running in your browser and potentially putting your information at risk. The attacker may even perform actions on your behalf. The scariest part? Stored XSS can target multiple users over time, without anyone needing to click a specific link.

DOM XSS – The Sneaky Browser Attack:
Let’s talk about DOM XSS, a crafty attack that targets the user’s browser itself. When a web application’s client-side code (such as JavaScript) processes user input and updates the page content without proper sanitization, the attacker spies an opportunity. They inject malicious scripts that execute when the page is updated. While DOM XSS may share similarities with reflected and stored XSS attacks, the difference lies in the manipulation of client-side code rather than server-side code.

Stay Safe, Mere Mortals:
To protect yourself and your web applications from these XSS threats, remember the golden rule: use proper input validation and output encoding. By doing so, you’ll ensure that user-generated content can’t be weaponized as a vehicle for executing malicious scripts. Surf safely, fellow mortals!

Cross-site scripting (XSS) attacks

How to cluster SAP ASCS/SCS with SIOS DataKeeper on VMware ESXi Servers

This article describes the steps you take to prepare the VMware infrastructure for installing and configuring a high-availability SAP ASCS/SCS instance on a Windows failover cluster by using SIOS DataKeeper as the replicated cluster storage.

Create the ASCS VMs

For SAP ASCS / SCS cluster, deploy two VMs on different ESXi Servers.

Based on your deployment type, the host names and the IP addresses of the scenario would be like:

SAP deployment

Host name roleHost nameStatic IP address
1st cluster node ASCS/SCS clusterpr1-ascs-1010.0.0.4
2nd cluster node ASCS/SCS clusterpr1-ascs-1110.0.0.5
Cluster Network Namepr1clust10.0.0.42
ASCS cluster network namepr1-ascscl10.0.0.43
ERS cluster network name (only for ERS2)pr1-erscl10.0.0.44

On each VM add an additional virtual disk. We will later mirror these disks with DataKeeper and use them as part of our cluster.

Add the Windows VMs to the domain

After you assign static IP addresses to the virtual machines, add the virtual machines to the domain.

Install and configure Windows failover cluster

Install the Windows failover cluster feature

Run this command on one of the cluster nodes:

PowerShell

Copy

# Hostnames of the Win cluster for SAP ASCS/SCS

$SAPSID = “PR1”

$ClusterNodes = (“pr1-ascs-10″,”pr1-ascs-11”)

$ClusterName = $SAPSID.ToLower() + “clust”

# Install Windows features.

# After the feature installs, manually reboot both nodes

Invoke-Command $ClusterNodes {Install-WindowsFeature Failover-Clustering, FS-FileServer -IncludeAllSubFeature -IncludeManagementTools }

Once the feature installation has completed, reboot both cluster nodes.

Test and configure Windows failover cluster

Copy

# Hostnames of the Win cluster for SAP ASCS/SCS

$SAPSID = “PR1”

$ClusterNodes = (“pr1-ascs-10″,”pr1-ascs-11”)

$ClusterName = $SAPSID.ToLower() + “clust”

# IP address for cluster network name 

$ClusterStaticIPAddress = “10.0.0.42”

# Test cluster

Test-Cluster –Node $ClusterNodes -Verbose

New-Cluster –Name $ClusterName –Node  $ClusterNodes –StaticAddress $ClusterStaticIPAddress -Verbose

Configure cluster cloud quorum

As you use Windows Server 2016 or 2019, we recommend configuring Azure Cloud Witness, as cluster quorum.

Run this command on one of the cluster nodes:

PowerShell

Copy

$AzureStorageAccountName = “cloudquorumwitness”

Set-ClusterQuorum –CloudWitness –AccountName $AzureStorageAccountName -AccessKey <YourAzureStorageAccessKey> -Verbose

Alternatively you can use a File Share Witness on a 3rd server in your environment. This server should be running on an 3rd ESXi host for redundancy. 

SIOS DataKeeper Cluster Edition for the SAP ASCS/SCS cluster share disk

Now, you have a working Windows Server failover clustering configuration. To install an SAP ASCS/SCS instance, you need a shared disk resource. One of the options is to use SIOS DataKeeper Cluster Edition.

Installing SIOS DataKeeper Cluster Edition for the SAP ASCS/SCS cluster share disk involves these tasks:

  • Install SIOS DataKeeper
  • Configure SIOS DataKeeper

Install SIOS DataKeeper

Install SIOS DataKeeper Cluster Edition on each node in the cluster. To create virtual shared storage with SIOS DataKeeper, create a synced mirror and then simulate cluster shared storage.

Before you install the SIOS software, create the DataKeeperSvc domain user.

Add the DataKeeperSvc domain user to the Local Administrator group on both cluster nodes.

  1. Install the SIOS software on both cluster nodes.
    SIOS installer
    Figure 31: First page of the SIOS DataKeeper installation
    First page of the SIOS DataKeeper installation
  2. In the dialog box, select Yes.
    Figure 32: DataKeeper informs you that a service will be disabled
    DataKeeper informs you that a service will be disabled
  3. In the dialog box, we recommend that you select Domain or Server account.
    Figure 33: User selection for SIOS DataKeeper
    User selection for SIOS DataKeeper
  4. Enter the domain account username and password that you created for SIOS DataKeeper.
    Figure 34: Enter the domain user name and password for the SIOS DataKeeper installation
    Enter the domain user name and password for the SIOS DataKeeper installation
  5. Install the license key for your SIOS DataKeeper instance.Figure 35: Enter your SIOS DataKeeper license key
    Enter your SIOS DataKeeper license key
  6. When prompted, restart the virtual machine.

Configure SIOS DataKeeper

After you install SIOS DataKeeper on both nodes, start the configuration. The goal of the configuration is to have synchronous data replication between the additional disks that are attached to each of the virtual machines.

  1. Start the DataKeeper Management and Configuration tool, and then select Connect Server.
    Figure 36: SIOS DataKeeper Management and Configuration tool
    SIOS DataKeeper Management and Configuration tool
  2. Enter the name or TCP/IP address of the first node the Management and Configuration tool should connect to, and, in a second step, the second node.
    Figure 37: Insert the name or TCP/IP address of the first node the Management and Configuration tool should connect to, and in a second step, the second node
    Insert the name or TCP/IP address of the first node the Management and Configuration tool should connect to, and in a second step, the second node
  3. Create the replication job between the two nodes.
    Figure 38: Create a replication job
    Create a replication job
    A wizard guides you through the process of creating a replication job.
  4. Define the name of the replication job.
    Figure 39: Define the name of the replication job
    Define the name of the replication job

    Define the base data for the node, which should be the current source node
  5. Define the name, TCP/IP address, and disk volume of the target node.
    Figure 41: Define the name, TCP/IP address, and disk volume of the current target node
    Define the name, TCP/IP address, and disk volume of the current target node
  6. Define the compression algorithms. In our example, we recommend that you compress the replication stream. Especially in resynchronization situations, the compression of the replication stream dramatically reduces resynchronization time. Compression uses the CPU and RAM resources of a virtual machine. As the compression rate increases, so does the volume of CPU resources that are used. You can adjust this setting later.
  7. Another setting you need to check is whether the replication occurs asynchronously or synchronously. When you protect SAP ASCS/SCS configurations, you must use synchronous replication.
    Figure 42: Define replication details
    Define replication details
  8. Define whether the volume that is replicated by the replication job should be represented to a Windows Server failover cluster configuration as a shared disk. For the SAP ASCS/SCS configuration, select Yes so that the Windows cluster sees the replicated volume as a shared disk that it can use as a cluster volume.
    Figure 43: Select Yes to set the replicated volume as a cluster volume
    Select Yes to set the replicated volume as a cluster volume
    After the volume is created, the DataKeeper Management and Configuration tool shows that the replication job is active.
    Figure 44: DataKeeper synchronous mirroring for the SAP ASCS/SCS share disk is active
    DataKeeper synchronous mirroring for the SAP ASCS/SCS share disk is active
    Failover Cluster Manager now shows the disk as a DataKeeper disk, as shown in Figure 45:
    Figure 45: Failover Cluster Manager shows the disk that DataKeeper replicated
    Failover Cluster Manager shows the disk that DataKeeper replicated

We don’t describe the DBMS setup in this article because setups vary depending on the DBMS system you use. We assume that high-availability concerns with the DBMS are addressed with the functionalities that different DBMS vendors support 

The installation procedures of SAP NetWeaver ABAP systems, Java systems, and ABAP+Java systems are almost identical. The most significant difference is that an SAP ABAP system has one ASCS instance. The SAP Java system has one SCS instance. The SAP ABAP+Java system has one ASCS instance and one SCS instance running in the same Microsoft failover cluster group. Any installation differences for each SAP NetWeaver installation stack are explicitly mentioned. You can assume that the rest of the steps are the same.

Install SAP with a high-availability ASCS/SCS instance

Important

If you use SIOS to present a shared disk, don’t place your page file on the SIOS DataKeeper mirrored volumes. 

Installing SAP with a high-availability ASCS/SCS instance involves these tasks:

  • Create a virtual host name for the clustered SAP ASCS/SCS instance.
  • Install SAP on the first cluster node.
  • Modify the SAP profile of the ASCS/SCS instance.

Create a virtual host name for the clustered SAP ASCS/SCS instance

  1. In the Windows DNS manager, create a DNS entry for the virtual host name of the ASCS/SCS instance.
    Important

    Figure 1: Define the DNS entry for the SAP ASCS/SCS cluster virtual name and TCP/IP address
    Define the DNS entry for the SAP ASCS/SCS cluster virtual name and TCP/IP address
  2. If you are using the new SAP Enqueue Replication Server 2, which is also a clustered instance, then you need to reserve in DNS a virtual host name for ERS2 as well.

    Figure 1A: Define the DNS entry for the SAP ASCS/SCS cluster virtual name and TCP/IP address
    Define the DNS entry for the SAP ERS2 cluster virtual name and TCP/IP address
  3. To define the IP address that’s assigned to the virtual host name, select DNS Manager > Domain.
    Figure 2: New virtual name and TCP/IP address for SAP ASCS/SCS cluster configuration
    New virtual name and TCP/IP address for SAP ASCS/SCS cluster configuration

Install the SAP first cluster node

  1. Execute the first cluster node option on cluster node A. Select:
    • ABAP system: ASCS instance number 00
    • Java system: SCS instance number 01
    • ABAP+Java system: ASCS instance number 00 and SCS instance number 01
  2. Follow the SAP described installation procedure. Make sure in the start installation option “First Cluster Node”, to choose “Cluster Shared Disk” as configuration option.

The SAP installation documentation describes how to install the first ASCS/SCS cluster node.

Modify the SAP profile of the ASCS/SCS instance

If you have Enqueue Replication Server 1, add SAP profile parameter enque/encni/set_so_keepalive as described below. The profile parameter prevents connections between SAP work processes and the enqueue server from closing when they are idle for too long. The SAP parameter is not required for ERS2.

  1. Add this profile parameter to the SAP ASCS/SCS instance profile, if using ERS1.
  2. Copy

enque/encni/set_so_keepalive = true

  1. For both ERS1 and ERS2, make sure that the keepalive OS parameters are set as described in SAP note 1410736.
  2. To apply the SAP profile parameter changes, restart the SAP ASCS/SCS instance.

Install the database instance

To install the database instance, follow the process that’s described in the SAP installation documentation.

Install the second cluster node

To install the second cluster, follow the steps that are described in the SAP installation guide.

Install the SAP Primary Application Server

Install the Primary Application Server (PAS) instance <SID>-di-0 on the virtual machine that you’ve designated to host the PAS.

Install the SAP Additional Application Server

Install an SAP Additional Application Server (AAS) on all the virtual machines that you’ve designated to host an SAP Application Server instance.

Test the SAP ASCS/SCS instance failover

For the outlined failover tests, we assume that SAP ASCS is active on node A.

  1. Verify that the SAP system can successfully failover from node A to node B Choose one of these options to initiate a failover of the SAP cluster group from cluster node A to cluster node B:
    • Failover Cluster Manager
    • Failover Cluster PowerShell
  2. PowerShell
  3. Copy

$SAPSID = “PR1”     # SAP <SID>

$SAPClusterGroup = “SAP $SAPSID”

Move-ClusterGroup -Name $SAPClusterGroup

  1. Restart cluster node A within the Windows guest operating system. This initiates an automatic failover of the SAP <SID> cluster group from node A to node B.
  2. Restart cluster node A from the vCenter. This initiates an automatic failover of the SAP <SID> cluster group from node A to node B.
  3. Verification

After failover, verify that SIOS DataKeeper is replicating data from source volume drive S on cluster node B to target volume drive S on cluster node A.
Figure 9: SIOS DataKeeper replicates the local volume from cluster node B to cluster node A
SIOS DataKeeper replicates the local volume from cluster node B to cluster node A

How to cluster SAP ASCS/SCS with SIOS DataKeeper on VMware ESXi Servers

How to use Azure Site Recovery (ASR) to replicate a Windows Server Failover Cluster (WSFC) that uses SIOS DataKeeper for cluster storage

Intro

So you have built a SQL Server Failover Cluster Instance (FCI), or maybe an SAP ASCS/ERS cluster in Azure. Each node of the cluster resides in a different Availability Zone (AZ), or maybe you have strict latency requirements and are using Placement Proximity Groups (PPG) and your nodes all reside in the same Availability Set. Regardless of the scenario, you now have a much higher level of availability for your business critical application than if you were just running a single instance.

Now that you have high availability (HA) covered, what are you going to do for disaster recovery? Regional disasters that take out multiple AZs are rare, but as recent history has shown us, Mother Nature can really pack a punch. You want to be prepared should an entire region go offline. 

Azure Site Recovery (ASR) is Microsoft’s disaster recovery-as-a-service (DRaaS) offering that allows you to replicate entire VMs from one region to another. It can also replicate virtual machines and physical servers from on-prem into Azure, but for the purpose of this blog post we will focus on the Azure Region-to-Region DR capabilities.

Setting up ASR

We are going to assume you have already built your cluster using SIOS DataKeeper. If not, here are some pointers to help get you started.

Failover Cluster Instances with SQL Server on Azure VMs

SIOS DataKeeper Cluster Edition for the SAP ASCS/SCS cluster share disk

We are also going to assume you are familiar with Azure Site Recovery. Instead of yet another guide on setting up ASR, I suggest you read the latest documentation from Microsoft. This article will focus instead on some things you may not have considered and the specific steps required to fix your cluster after a failover to a different subnet.

Paired Regions

Before you start down the DR path, you should be aware of the concept of Azure Paired Regions. Every Region in Azure has a preferred DR Region. If you want to learn more about Paired Regions, the documentation provides a great background. There are some really good benefits of using your paired region, but it’s really up to you to decide on what region you want to use to host your DR site.

Cloud Witness Location

When you originally built your cluster you had to choose a witness type for your quorum. You may have selected a File Share Witness or a Cloud Witness. Typically either of those witness types should reside in an AZ that is separate from your cluster nodes. 

However, when you consider that, in the event of a disaster, your entire cluster will be running in your DR region, there is a better option. You should use a cloud witness, and place it in your DR region. By placing your cloud witness in your DR region, you provide resiliency not only for local AZ failures, but it also protects you should the entire region fail and you have to use ASR to recover your cluster in the DR region. Through the magic of Dynamic Quorum and Dynamic Witness, you can be sure that even if your DR region goes offline temporarily, it will not impact your production cluster.

Multi-VM Consistency

When using ASR to replicate a cluster, it is important to enable Multi-VM Consistency to ensure that each cluster node’s recovery point is from the same point in time. That ensures that the DataKeeper block level replication occurring between the VMs will be able to continue after recovery without requiring a complete resync.

Crash Consistent Recovery Points

Application consistent recovery points are not supported in replicated clusters. When configuring the ASR replication options do not enable application consistent recovery points.

Keep IP Address After Failover?

When using ASR to replicate to your DR site there is a way to keep the IP address of the VMs the same. Microsoft described it in the article entitled Retain IP addresses during failover. If you can keep the IP address the same after failover it will simplify the recovery process since you won’t have to fix any cluster IP addresses or DataKeeper mirror endpoints, which are based on IP addresses.

However, in my experience, I have never seen anyone actually follow the guidance above, so recovering a cluster in a different subnet will require a few additional steps after recovery before you can bring the cluster online.

Your First Failover Attempt

Recovery Plan

Because you are using Multi-VM Consistency, you have to failover your VMs using a Recovery Plan. The documentation provides pretty straightforward guidance on how to do that. A Recovery Plan groups the VMs you want to recover together to ensure they all failover together. You can even add multiple groups of VMs to the same Recovery Plan to ensure that your entire infrastructure fails over in an orderly fashion.

A Recovery Plan can also launch post recovery scripts to help the failover complete the recovery successfully. The steps I describe below can all be scripted out as part of your Recovery Plan, thereby fully automating the complete recovery process. We will not be covering that process in this blog post, but Microsoft documents this process.

Static IP Addresses

As part of the recovery process you want to make sure the new VMs have static IP addresses. You will have to adjust the interface properties in the Azure Portal so that the VM always uses the same address. If you want to add a public IP address to the interface you should do so at this time as well.

Network Configuration

After the replicated VMs are successfully recovered in the DR site, the first thing you want to do is verify basic connectivity. Is the IP configuration correct? Are the instances using the right DNS server? Is name resolution functioning correctly? Can you ping the remote servers?

If there are any problems with network communications then the rest of the steps described below will be bound to fail. Don’t skip this step!

Load Balancer

As you probably know, clusters in Azure require you to configure a load balancer for client connectivity to work. The load balancer does not fail over as part of the Recovery Plan. You need to build a new load balancer based on the cluster that now resides in this new vNet. You can do this manually or script this as part of your Recovery Plan to happen automatically.

Network Security Groups 

Running in this new subnet also means that you have to specify what Network Security Group you want to apply to these instances. You have to make sure the instances are able to communicate across the required ports. Again, you can do this manually, but it would be better to script this as part of your Recovery Plan.

Fix the IP Cluster Addresses

If you are unable to make the changes described earlier to recover your instances in the same subnet, you will have to complete the following steps to update your cluster IP addresses and the DataKeeper addresses for use in the new subnet.

Every cluster has a core cluster IP address. What you will see if you launch the WSFC UI after a failover is that the cluster won’t be able to connect. This is because the IP address used by the cluster is not valid in the new subnet.

If you open the properties of that IP Address resource you can change the IP address to something that works in the new subnet. Make sure to update the Network and Subnet Mask as well.

Once you fix that IP Address you will have to do the same thing for any other cluster address that you use in your cluster resources.

Fix the DataKeeper Mirror Addresses

SIOS DataKeeper mirrors use IP addresses as mirror endpoints. These are stored in the mirror and mirror job. If you recover a DataKeeper based cluster in a different subnet, you will see that the mirror comes up in a Resync Pending state. You will also notice that the Source IP and the Target IP reflect the original subnet, not the subnet of the DR site.

Fixing this issue involves running a command from SIOS called CHANGEMIRRORENDPOINTS. The usage for CHANGEMIRRORENDPOINTS is as follows.

emcmd <NEW source IP> CHANGEMIRRORENDPOINTS <volume letter> <ORIGINAL target IP> <NEW source IP> <NEW target IP>

In our example, the command and output looked like this.

After the command runs, the DataKeeper GUI will be updated to reflect the new IP addresses as shown below. The mirror will also go to a mirroring state.

Conclusions

You have now successfully configured and tested disaster recovery of your business critical applications using a combination of SIOS DataKeeper for high availability and Azure Site Recovery for disaster recovery. If you have questions please leave me a comment or reach out to me on Twitter @daveberm

How to use Azure Site Recovery (ASR) to replicate a Windows Server Failover Cluster (WSFC) that uses SIOS DataKeeper for cluster storage

Performance of Azure Shared Disk with Zone Redundant Storage (ZRS)

On September 9th, 2021, Microsoft announced the general availability of Zone-Redundant Storage (ZRS) for Azure Disk Storage, including Azure Shared Disk.

What makes this interesting is that you can now build shared storage based failover cluster instances that span Availability Zones (AZ).  With cluster nodes residing in different AZs, users can now qualify for the 99.99% availability SLA. Prior to support for ZRS, Azure Shared Disks only supported Locally Redundant Storage (LRS), limiting cluster deployments to a single AZ, leaving users susceptible to outages should an AZ go offline.

There are however a few limitations to be aware of when deploying an Azure Shared Disk with ZRS.

  • Only supported with premium solid-state drives (SSD) and standard SSDs. Azure Ultra Disks are not supported.
  • Azure Shared Disks with ZRS are currently only available in West US 2, West Europe, North Europe, and France Central regions
  • Disk Caching, both read and write, are not supported with Premium SSD Azure Shared Disks
  • Disk bursting is not available for premium SSD
  • Azure Site Recovery support is not yet available.
  • Azure Backup is available through Azure Disk Backup only.
  • Only server-side encryption is supported, Azure Disk Encryption is not currently supported.

I also found an interesting note in the documentation.

“Except for more write latency, disks using ZRS are identical to disks using LRS, they have the same scale targets. Benchmark your disks to simulate the workload of your application and compare the latency between LRS and ZRS disks.”

While the documentation indicates that ZRS will incur some additional write latency, it is up to the user to determine just how much additional latency they can expect. A link to a disk benchmark document is provided to help guide you in your performance testing.

Following the guidance in the document, I used DiskSpd to measure the additional write latency you might experience. Of course results will vary with workload, disk type, instance size, etc.,but here are my results.

Locally Redundant Storage (LRS)Zone Redundant Storage (ZRS)
Write IOPS5099.824994.63
Average Latency7.8307.998

The DiskSpd test that I ran used the following parameters. 


diskspd -c200G -w100 -b8K -F8 -r -o5 -W30 -d10 -Sh -L testfile.dat

I wrote to a P30 disk with ZRS and a P30 with LRS attached to a Standard DS3 v2 (4 vcpus, 14 GiB memory) instance type. The shared ZRS P30 was also attached to an identical instance in a different AZ and added as shared storage to an empty cluster application.

A 2% overhead seems like a reasonable price to pay to have your data distributed synchronously across two AZs. However, I did wonder what would happen if you moved the clustered application to the remote node, effectively putting your disk in one AZ and your instance in a different AZ. 

Here are the results.

Locally Redundant Storage (LRS)Zone Redundant Storage (ZRS)ZRS when writing from the remote AZ
Write IOPS5099.824994.634079.72
Average Latency7.8307.9989.800

In that scenario I measured a 25% write latency increase. If you experience a complete failure of an AZ, both the storage and the instance will failover to the secondary AZ and you shouldn’t experience this increase in latency at all. However, other failure scenarios that aren’t AZ wide could very well have your clustered application running in one AZ with your Azure Shared Disk running in a different AZ. In those scenarios you will want to move your clustered workload back to a node that resides in the same AZ as your storage as soon as possible to avoid the additional overhead.

Microsoft documents how to initiate a storage account failover to a different region when using GRS, but there is no way to manually initiate the failover of a storage account to a different AZ when using ZRS. You should monitor your failover cluster instance to ensure you are alerted any time a cluster workload moves to a different server and plan to move it back just as soon as it is safe to do so.

You can find yourself in this situation unexpectedly, but it will also certainly happen during planned maintenance of the clustered application servers when you do a rolling update. Awareness is the key to help you minimize the amount of time your storage is performing in a degraded state.

I hope in the future Microsoft allows users to initiate a manual failover of a ZRS disk the same as they do with GRS. The reason they added the feature to GRS was to put the power in the hands of the users in case automatic failover did not happen as expected. In the case of ZRS I could see people wanting to try to tie together storage and application, ensuring they are always running in the same AZ, similar to how host based replication solutions like SIOS DataKeeper do it.

Performance of Azure Shared Disk with Zone Redundant Storage (ZRS)

Facebook, Instagram and WhatsApp just had a really bad Monday

It’s the end of the work day here on the east coast and I see that the Facebook is still unavailable. Facebook acknowledged the problem in the following two Tweets.

I can pinpoint the time that Facebook went offline for me. I was trying to post a comment on a post and my comment was not posting. I was a little annoyed, and almost thought the poster had blocked me, or was deleting my comment. This was at 11:45 am EDT. 5+ hours Facebook for me is still down.

While we don’t know the exact cause of the downtime, and whether it was user error, some nefarious assault, or just an unexpected calamity of errors, we can learn a few things about this outage at this point.

Downtime is expensive

While we may never know the exact cost of the downtime experienced today, there are a few costs that can already measured. As of this writing, Facebook stock went down 4.89% today. That’s on top of an already brutal September for Facebook and other tech stocks.

The correction may have been inevitable, but the outage today certainly didn’t help matters.

But what was the real cost to the company? With many brands leveraging social media as an important part of their marketing outreach, how will this outage impact future advertising spends? Minimally I anticipate advertisers to investigate other social media platforms if they have not done so already. Only time will tell, but even before this outage we have seen more competition for marketing spend from other platforms such as TickTock.

Plan for the worst-case scenario

Things happen, we know that and plan for that. Business Continuity Plans (BCP) should be written to address any possible disaster. Again, we don’t know the exact cause of this particular disaster, but I would have to imaging that an RTO of 5+ hours is not written into any BCP that sits on the shelf at Facebook, Instagram or WhatsApp.

What’s in your your BCP? Have you imagined any possible disaster? Have your measured the impact of downtime and defined adequate recovery time objective (RTO) and recovery point objective (RPO) for each component of your business? I would venture to say that it’s impossible to plan for every possible thing that can go wrong. However, I would advise everyone to revisit your BCP on a regular basis and update it to include disasters that maybe weren’t on the radar the last time you reviewed your BCP. Did you have global pandemic in your BCP? If not, you may have been left scrambling to accomodate a “work from home” workforce. The point is, plan for the worst and hope for the best.

Communications in a disaster

Communications in the event of a disaster should be its own chapter in your BCP.

One Facebook employee told Reuters that all internal tools were down. Facebook’s response was made much more difficult because employees lost access to some of their own tools in the shutdown, people tracking the matter said.

Multiple employees said they had not been told what had gone wrong.

https://www.reuters.com/technology/facebook-instagram-down-thousands-users-downdetectorcom-2021-10-04/

A truly robust BCP must include multiple fallback means of communication. This becomes much more important as your business spreads out across multiple building, regions or countries. Just think about how your team communicates today. Phone, text, email, Slack might be your top four. But what if they are all unavailable, how would you reach your team? If you don’t know you may want to start investigating other options. You may not need a shortwave radio and a flock of carrier pigeons, but I’m sure there is a government agency that keeps both of those on hand for a “break glass in case of emergency” situation.

Summary

You have a responsibility to yourself, your customers and your investors to make sure you take every precaution concerning the availability of your business. Make sure you invest adequate resources in creating your BCP and that the teams responsible for business continuity have the tools they need to ensure they can do their part in meeting the RTO and RPO defined in your BCP.

Facebook, Instagram and WhatsApp just had a really bad Monday

AWS BYOC – Bring your own SQL Server cluster with AWS EC2 and SIOS

Nice blog post from SQL Server expert Steve Rezhener.

DataSteve


Introduction

They are multiple options to implement HADR solution for SQL Server in AWS public cloud. The easiest way to do that is to use AWS Database as a Service (DBaaS) product known as AWS RDS and enable Multi-AZ option using your AWS Control Panel (Fig #1).

See the source image
Fig #1

That is all it takes to roll out or add it to an existing instance. AWS will take care of the rest under the hood (depending on your SQL Server version, it might use Mirroring or Always On Availability Groups). It will provision all the necessary things for an automatic failover (witness, network, storage, etc…), so when a primary node will go down, it will be replaced with a secondary node. No ifs and no buts. This blog post is going to discuss how to build your own HADR solution using multiple AWS EC2s (BYOC) instead of a managed AWS…

View original post 1,104 more words

AWS BYOC – Bring your own SQL Server cluster with AWS EC2 and SIOS

Clustering SAP #ASCS and #ERS on #AWS using Windows Server Failover Clustering

When ensuring high availability for SAP ASCS and ERS running on WIndows Server, the primary cluster solution you will want to use is Windows Server Failover Clustering. However, when doing this in AWS you will quickly discover that there are a few obstacles you need to know how to overcome when deploying this in AWS.

I recently wrote this Step-by-Step guide that was published on the SAP blog that walks you through the entire process. If you have any questions, please leave a comment.

Clustering SAP #ASCS and #ERS on #AWS using Windows Server Failover Clustering

Changing a Drive Letter with PowerShell

Here is a short but sweet post on how to change the drive letter of a partition. Despite using my best Google skills I couldn’t find an example that was doing it for me. I rolled up my sleeves and just figured it out on my own. I hope this helps someone out there.

Set-Partition -DiskNumber 4 -PartitionNumber 1 -NewDriveLetter X

As long as you know the DiskNumber and PartitionNumber this will immediately change the drive letter of the partition you specify.

You may also need to import the Storage module into Powershell before you can do this.

Import-Module -Name Storage

Let me know if this helped you

Changing a Drive Letter with PowerShell

Using Amazon FSx for SQL Server Failover Cluster Instances – What You Need to Know!

Intro

If you are considering deploying your own Microsoft SQL Server instances in AWS EC2 you have some decisions to make regarding the resiliency of the solution. Sure, AWS will offer you a 99.99% SLA on your Compute resources if you deploy two or more instances across different availability zones. But don’t be fooled, there are a lot of other factors you need to consider when calculating your true application availability. I recently blogged about how to calculate your application availability in the cloud. You probably should have a quick read of that article before you move on.

When it comes to ensuring your Microsoft SQL Server instance is highly available, it really comes down to two basic choices: Always On Availability Group (AG) or SQL Server Failover Cluster Instance (FCI). If you are reading this article I’m making an assumption you are well aware of both of these options and are seriously considering using a SQL Server FCI instead of a SQL Server Always On AG.

Benefits of a Microsoft SQL Server Failover Cluster Instance

The following list summarize what AWS says are the benefits of a SQL Server FCI:

FCI is generally preferable over AG for SQL Server high availability deployments when the following are priority concerns for your use case:

License cost efficiency: You need the Enterprise Edition license of SQL Server to run AGs, whereas you only need the Standard Edition license to run FCIs. This is typically 50–60% less expensive than the Enterprise Edition. Although you can run a Basic version of AGs on Standard Edition starting from SQL Server 2016, it carries the limitation of supporting only one database per AG. This can become a challenge when dealing with applications that require multiple databases like SharePoint.

Instance-level protection versus database-level protection: With FCI, the entire instance is protected – if the primary node becomes unavailable, the entire instance is moved to the standby node. This takes care of the SQL Server logins, SQL Server Agent jobs, certificates, etc. that are stored in the system databases, which are physically stored in shared storage. With AG, on the other hand, only the databases in the group are protected, and system databases cannot be added to an AG – only user databases are allowed. It is the database administrator’s responsibility to replicate changes to system objects on all AG replicas. This leaves the possibility of human error causing the database to become inaccessible to the application.

DTC feature support: If you’re using SQL Server 2012 or 2014, and your application uses Distributed Transaction Coordinator (DTC), you are not able to use an AG as it is not supported. Use FCI in this situation.

https://aws.amazon.com/blogs/storage/simplify-your-microsoft-sql-server-high-availability-deployments-using-amazon-fsx-for-windows-file-server/

Challenges with FCI in the Cloud

Of course, the challenge with building an FCI that spans availability zones is the lack of a shared storage device that is normally required when building a SQL Server FCI. Because the nodes of the cluster are distributed across multiple datacenters, a traditional SAN is not a viable option for shared storage. That leaves us with a two choices for cluster storage: 3rd party storage class resources like SIOS DataKeeper or the new Amazon FSx. Let’s take a look at what you need to know before you make your choice.

Service Level Agreement

As I wrote in how to calculate your application availability, your overall application SLA is only as good as your weakest link. In this case, the FSx SLA of 99.9%.

Normally 99.99% availability represents the starting point of what is considered “highly available”. This is what AWS promises you for your compute resources when two or more are deployed in different availability zones.

In case you didn’t know the difference between three nines and four nines…

  • 99.9% availability allows for 43.83 minutes of downtime per month
  • 99.99% availability allows for only 4.38 minutes of downtime per month

By hosting your cluster storage on FSx despite your 99.99% compute availability, your overall application availability will be 99.9%. In contrast, EBS volumes that span availability zones, such as in a DataKeeper deployment, qualifies for the 99.99% SLA at both the storage and compute layers, meaning your overall application availability is 99.99%.

Storage Location

When configuring FSx for high availability, you will want to enable multi-AZ support. By enabling multi-AZ you have an effectively have a “preferred” AZ and a “standby” AZ. When you deploy your SQL Server FCI nodes you will want to distribute those nodes across the same AZs.

Now in normal situations, you will want to make sure the active cluster node resides in the same AZ as the preferred FSx storage node. This is to minimize the distance and latency to your storage, but also to minimize the costs associated with data transfer across AZs. As specified in the FSx price guide, “Standard data transfer fees apply for inter-AZ or inter-region access to file systems.”

In the unfortunate circumstance where you have a SQL Server FCI failure, but not a FSx failure, there is no mechanism to tie both the storage and compute together. In the event that FSx fails over, it will automatically fail back to the primary availability zone. However, best practices dictate SQL FCI remain running on the secondary node until root cause analysis is performed and fail back is typically scheduled to occur during maintenance periods. This leaves you in a situation where your storage resides in a different AZ, which will incur additional costs. Currently the cost for transferring data across AZs, both ingress and egress, is $0.01/GB.

Without keeping a close eye on the state of your FSx and SQL Server FCI, you may not even be aware that they are running in different regions until you see the data transfer charge at the end of the month.

In contrast, in a configuration that use SIOS DataKeeper, the storage failover is part of the SQL Server FCI recovery, ensuring that the storage always fails over with the SQL Server instance. This ensures SQL Server is always reading and writing to the EBS volumes that are directly attached to the active node. Keep in mind, DataKeeper will incur a data transfer cost associated with write operations which are replicated between AZs or regions. This data transfer cost can be minimized with the use of compression available in DataKeeper.

Controlling Failover

In an FSx multi-subnet configuration there is a preferred availability zone and a standby availability. Should the FSx file server in the preferred availability zone experience a failure, the file server in the standby AZ will recover. AWS reports that this recovery time takes about 30 seconds with standard shares. With the use of continuously available file shares Microsoft reports this failover time can be closer to 15 seconds. During this failover time, there is a brownout that occurs where reads and writes are paused, but will continue once recovery completes.

FSx multi-site has automatic failback enabled, meaning that for every unplanned failover of FSx, you also have an unplanned failback. In contrast, typically when a SQL Server FCI experience an unplanned failover you would either just leave it running on the secondary or schedule a failback after hours or during the next maintenance period.

SQL Server Analysis Services Cluster Not Supported with FSx

If you want to cluster SSAS, you will need a clustered disk resource like SIOS DataKeeper. The How to Cluster SQL Server Analysis Server white paper clearly states that SMB cannot be used and that cluster drives with drive letters must be used. In contrast, the DataKeeper Volume resource presents itself as a clustered disk and can be used with SSAS.

Summary

While FSx certainly can make sense for typical SMB uses like Windows user files and other non-critical services where 99.9% availability SLA suffices, FSx is an excellent option If you application requires high availability (99.99%) or HA/DR solutions that also span regions, the SIOS DataKeeper is the right fit.

Using Amazon FSx for SQL Server Failover Cluster Instances – What You Need to Know!

Calculating Application Availability in the Cloud

When deploying business critical applications in the cloud you want to make sure they are highly available. The good news is that if you plan properly, you can achieve 99.99% (4-nines) of availability or more. However, calculating your true availability may not be as straightforward as it seems.

When considering availability you must consider the key components that make access to your application possible, which I’ll call the availability chain. Component of the availability chain are:

  • Compute
  • Network 
  • Storage
  • Application
  • Dependent services

Your application is only as available as your weakest link, and your downtime increases exponentially with each additional link you add to the chain.  Let’s examine each of the links. 

Compute Availability

Each of the three major cloud service providers have some similarities. One thing in common across all three platforms is the service level agreements (SLA) they will commit to for compute.

The SLA for all three public cloud providers for VMs when you have two or more VMs configured across different availability zones is 99.99%. Keep in mind, this SLA only guarantees the remote accessibility of one of the VMs at any given time, it makes no promises as to the availability of the services or application(s) running inside the VM. If you deploy a single VM within a single datacenter, this SLA varies from “90% of each hour” (AWS) to 99.5% (Azure and GCP) or 99.9% (Azure single VM when using Premium SSD).

True high availability starts at 99.99%, so the first step is to ensure your application is available is to make sure the application is distributed across two or more VMs that span availability zones. With two VMs spread across two availability zones, giving you 99.99% availability of at least one of those VMs, you could theorize that if you had three VMs spread across three availability zones your availability would be even greater than 99.99%. Although the cloud providers’ SLA will never guarantee beyond 99.99% availability regardless of the number of availability zones in use, if you use pure statistics you might come to the conclusion that your availability could jump to as high as 99.999999% or 8-nines of availability, 26.30 milliseconds downtime per month.

1-(.0001*.0001) = .99999999

99.999999% availability with three availability zones?

Don’t go around quoting that number, but just keep in mind that it makes sense that if two availability zones can give you 99.99% availability, it stands to reason that three availability zones is going to give you something significantly more than 99.99% availability.

Compute is just one link in the availability chain. We still have to address network, storage and other dependent services, which all represent possible points of failure.

Network Availability

In order for your application to be available, every network hop between the client and the application and all the resources that the application depends on, must be available and working within tolerable latency ranges. You need to understand the network links between database servers, application servers, web servers and clients to know precisely where the network might fail. And remember, the more links in your availability chain the lower your overall availability will be.

Although network availability betweens VMs in the same vNet are covered under the standard compute SLA, there are other network services that you may be utilizing. Here are just a few examples of network services you could be utilizing which would impact overall application availability.

Express Route – 99.95%
VPN Gateway – 99.9% through 99.95%
Load Balancer – 99.99%
Traffic Manager – 99.99%
Elastic Load Balancer – 99.99%
Direct Connect – 99.9% – 99.99%

Building on what we have learned so far, let’s take a look at the availability of an application that is deployed across two availability zones. 

99.99% compute availability

99.99% load balancer availability

.9999 * .9999 = .9998

99.98% availability = ~9 minutes downtime per month

Now that we have addressed compute and network availability, let’s move on to storage.

Storage Availability

Now here is where the story gets a little hairy. Have a look at the following storage SLAs

https://azure.microsoft.com/en-us/support/legal/sla/storage/v1_5/

https://cloud.google.com/storage/sla

https://aws.amazon.com/compute/sla/

It seems pretty clear that Azure and Google are giving you a 99.9% SLA on block storage solutions. AWS doesn’t mention EBS specifically here. They only talk about VMs and measure their single instance VMs availability by the hour instead of by the month as the other cloud providers do. For sake of discussion, lets use the 99.9% availability guarantee that both Azure and GCP have published.

Building upon our previous example, let’s add some storage to the equation.

99.99% compute availability

99.99% load balancer availability

99.9% managed disk

.9999 * .9999 * .999 = .9988

99.88% availability = ~53 minutes of downtime per month.

53 minutes of downtime is a lot more than the 9 minutes of downtime we calculated in our previous example. What can we do to minimize the impact of the 99.9% storage availability? We have to build more redundancy in the storage! 

Fortunately, we usually include storage redundancy when planning for application availability. For instance, when we stand up web servers, each web server will typically store data on the locally attached disk. When deploying domain controllers, Microsoft Active Directory takes care of replicating AD information across all the domain controllers. In the case of something like SQL Server, we leverage things Always On Availability Groups or SIOS DataKeeper to keep the data in sync across locally attached disks.

The more copies of the data we have distributed across different availability zones, the more likely we will be able to survive a failure.

For example, an application that stores its data across two different disks in different availability zones will benefit from the redundancy and instead of 99.9% availability it is more likely to achieve 99.9999% availability of the storage.

1 – (.001 * .001) = .999999

If we throw that into the previous equation the picture starts to look a little brighter.

.9999 * .9999 * .999999 = .9998

99.98% availability = ~9 minutes of downtime

By duplicating the data across multiple AZs, and therefore multiple disks, we have effectively mitigated the downtime associated with cloud storage.

Application and Dependent Services Availability

You’ve done all you can do to ensure compute, network, and storage availability. But what about the application itself? Some applications can scale out and provide redundancy by load balancing between multiple instances of the same application. Think of your typical web server farm where you may typically load balance web requests between five servers. If you lose one server the load balancer simply removes it from it’s rotation until it is once again responsive.

Other applications require a little more care and monitoring. Take SQL Server for instance. Typically Always On Availability Groups or Failover Cluster Instances are used to monitor database availability and take recovery actions should a database become unresponsive due to application or system level failures. While there is no published SLA for SQL Server availability solutions, it is commonly accepted that when configured properly for high availability, a SQL Server can provide 99.99% availability.

You may rely on other cloud based services, like hosted Active Directory, hosted DNS, microservices, or even the availability of the cloud portal itself should all be factored into your overall availability equation.

Summary

Application availability is the sum of all the moving parts. Skimping in just one area can exponentially impact the overall availability of your application. Take your time and investigate all the links in your availability chain for weakness including compute, network, storage, application and dependent services.

In general the numbers presented here are hopefully worst case scenarios and your actual availability should exceed the published SLAs. Do your homework and be wary of any service that can not guarantee 99.99% availability, the typical threshold of what is considered highly available.

Human error and security were not addressed in this article. You can make your application as highly available as possible, but if you have not taken steps to secure your application against external threats and stupid human mistakes then all bets are off when it comes to availability.

Calculating Application Availability in the Cloud