How can asynchronous replication be used in a multi-site cluster? Isn’t the data out of sync?

I have been asked this question more than a few times, so I thought I would answer it in my first blog post.  The basic answer is yes, you can lose data in an unexpected failure when using asynchronous replication in a multi-site cluster.  In an ideal world, every company would have a dark fiber connection to their DR site and use synchronous replication with their multi-site cluster, eliminating the possibility of data loss.  However, the reality is that in many cases, the WAN connectivity to the DR site has too much latency to support synchronous replication.  In such cases, asynchronous replication is an excellent alternative.

There are more than a few options when choosing an asynchronous replication solution to use with your WSFC multi-site cluster, including array based solutions from companies like EMC, IBM, HP, etc. and host based solutions, like the one that is near and dear to me, “SteelEye DataKeeper Cluster Edition“.  Since I know DataKeeper best, I will explain how this all works from DataKeeper’s prospective.

When using SteelEye DataKeeper and asynchronous replication, we allow a certain number of writes to be stored in the async queue.  The number of writes which can be queued is determined by the “high water mark”, which is an adjustable value used by DataKeeper to determine how much data can be in the queue before the mirror state is changed from “mirroring” to “paused”.  A “paused” state is also entered anytime there is a communication failure between the secondary and primary server. While in a paused state, automatic failover in a multi-site cluster is disabled, limiting the amount of data that can be lost in an unexpected failure.  If the original data set is deemed “lost forever”, then the remaining data on the target server can be manually unlocked and the cluster node can then be brought into service.

While in the “paused” state, DataKeeper allows the async queue to drain until we reach the “low water mark”, at which point the mirror enters a “resync” state until all of the data is once again is in sync.  At that point, the mirror is once again in the “mirroring” state and automatic failover is once again enabled.

As long as your WAN link is not saturated or broken, you should never see more than a few writes at any given time in this async queue.   In an unexpected failure (think pulled power cord) you will lose any write that is in the async queue.  This is the trade off you make when you want the awesome recovery point objective (RPO) and recovery time objective (RTO) which you achieve with a multi-site cluster, but your WAN link has too much latency to effectively support synchronous replication.
If you take time to monitor the DataKeeper Async Queue via the Windows Performance Logs and Alerts, I think you will be pleasantly surprised to find that most of the time the async queue is empty, due to the efficiency of the DataKeeper replication engine.  Even in times of heavy writes, the async queue seldom grows very large and always drains almost immediately, so the amount of data that is at risk at any given time is minimal.  When you consider the alternative in a disaster may be to restore from last night’s backup, the number of writes you could loss in an unexpected failure using asynchronous replication is minimal!

Of course, there are some instances where even losing a single write is not tolerable.  In those cases, it is recommended to use SteelEye DataKeeper’s synchronous replication option across a high-speed, low latency LAN or WAN connection.

How can asynchronous replication be used in a multi-site cluster? Isn’t the data out of sync?