You are the weakest link – Goodbye!

When building HA clusters, your application availability is only as good as its weakest link. What this means is that if you bought great servers with redundant everything (CPU, fans, power, RAID, RAM, etc) and a super deluxe SAN with multi-path connectivity, multiple SAN switches and clustered your application with your favorite clustering software you probably have a very reliable application – right? Well, not necessarily. Are the servers plugged into the same UPS? Are they on the same network switch? Are they cooled by the same AC unit? Are they in the same building? Is your SAN truly reliable? Any one of these issues among others is a single point of failure in your cluster configuration and needs to be considered.

Of course, you have to know when “good enough” is “good enough”. Your budget and your SLAs will help decide what exactly is good enough. However, one area where I am concerned that people may be skimping is in the area of storage. With the advent of cheap or free iSCSI target software solutions, I am seeing some people recommend that you just throw some iSCSI target software on a spare server and voilà – instant shared storage.

Mind you I’m not talking about OEM iSCSI solutions that have built in failover technology and/or other availability features; or even storage virtualization solutions such as FalconStor. I’m talking about the guy who has a server running Windows Server 2008 that he has loaded up with storage and wants to turn it into an iSCSI target. While this is great in a lab, if you are serious about HA, you should think again. Even Microsoft only provides their iSCSI target software to qualified OEM builders experienced in delivering enterprise class storage arrays.

First of all, this is Windows, not some hardened OS built to only serve storage. It will require maintenance, security updates, hardware fixes, etc. It basically has the same reliability as the application server you are trying to protect – does it make sense to cluster your application servers but yet use the same class of server and OS to host your storage? You basically have moved your single point of failure away from your application server and moved it to your storage server – not a smart move as far as I am concerned.

Some of the Enterprise Class iSCSI target software includes synchronous and/or asynchronous replication software and snapshot capabilities. While this functionality certainly helps in terms of your recovery point objective (RPO), it won’t help your recovery time objective (RTO) unless the failover is automatic and seamless to your clustering software. If the primary iSCSI storage array fails in the middle of the night, who is going to be there to activate the replicated copy? You may be down for quite some time before you even realize there is a problem. Again, this may be “good enough”; you just need to be aware of what you are signing up for.

One thing you can do to improve the reliability of your iSCSI target server is to use a replication product such as SteelEye DataKeeper Cluster Edition to eliminate the single point of failure. Let me illustrate.

Figure 1 – Typical Shared Storage Configuration. In the event that the iSCSI target becomes unavailable, all the nodes go offline.

If we take the same configuration shown above and add a hot standby iSCSI target using SteelEye DataKeeper Cluster Edition to do replication AND automatic failover, you have just given you iSCSI target solution a whole new level of availability. That solution would look very much like this.

Figure 2 – In this scenario, DataKeeper Cluster Edition is replicating the iSCSI attached volume on the active node to the iSCSI attached volume on the passive node, which is connected to an entirely different iSCSI target server.

The key difference in the solution which utilizes SteelEye DataKeeper Cluster Edition vs. replication solutions provide by some iSCSI target vendors is in the integration with WSFC. The question to ask of your iSCSI solution vendor is this…

What happens if I pull the power cord on the active iSCSI target server?

If the recovery process is a manual procedure, it is not a true HA solution. If it is automatic and completely integrated with WSFC, then you have a much higher level of availability and have eliminated the iSCSI array as a single point of failure.

You are the weakest link – Goodbye!

Hyper-V Multi-Site Cluster with Windows Server 2008 R2 and SteelEye DataKeeper Cluster Edition

I downloaded the Windows Server 2008 R2 bits this afternoon and spent the rest of the afternoon installing it and testing the long awaited Live Migration. Well, as I have seen in the numerous demos, it works great! And to top it off, I did a Live Migration in a multi-site cluster with DataKeeper Cluster Edition and it all worked great as expected. Keep your eyes peeled for a new video that will show you how to set up a Hyper-V disaster recovery configuration using Window Server 2008 R2 in a multi-site cluster with SteelEye DataKeeper. I’ll post the link just as soon as it is ready.

Hyper-V Multi-Site Cluster with Windows Server 2008 R2 and SteelEye DataKeeper Cluster Edition

Making sense of virtualization availability options

With the recent release of Microsoft Windows Server 2008 R2 and vSphere 4.0, I thought it was a good time to review some of the options available when considering the availability of your virtual servers and the applications running on them. I also will take this opportunity to describe some of the features that enable virtual machine availability. I have grouped these features into their function roles to help highlight their purpose.

Planned Downtime

Live Migration and VMware’s VMotion are both solutions that allow an administrator to move a virtual machine from one physical server to another with no perceivable downtime. The key thing to remember about this technology is that in order to move a virtual machine from one server to another without any downtime, the move must be a planned event. The reason that it must be a planned event is that the virtual machine’s memory must be synchronized between the servers before the actual switchover can occur. This is true of both Microsoft’s and VMware’s solutions. Also keep in mind that both of these technologies require the use of shared storage to hold the virtual hard disks (VMDK and VHD files), which limits Live Migration and VMotion to local area networks. This also means that any downtime planned for the storage array must be handled in a different way if you want to limit the impact to your virtual machines.

Unplanned Downtime

Microsoft’s Windows Server Failover Clustering and VMware’s High Availability (HA) are the solutions that are available to protect virtual machines in the event of unplanned downtime. Both solutions are similar in that they monitor virtual machines for availability and in the case of a failure the VMs are moved to the standby node. This recovery process requires that the machines be rebooted since there was no time to sync the memory before failover.

Disaster Recovery

How do I recover my virtual machines in the event of a complete site loss? The good news is that virtualization makes this process a whole lot easier since a virtual machine is just a file that can be picked up and moved to another server. While up to this point VMware and Microsoft are pretty similar in their availability features and functionality, but here is where Microsoft really shines. VMware offers Site Recovery Manager which is a fine product, but is limited in support to only SRM-certified array-based replication solutions. Also, the failover and failback process is not trivial and can take the better part of a day to do a complete round trip from the DR site back to the primary data center. It does have some nice features like DR testing, but in my experience with Microsoft’s solution for disaster recovery they have a much better solution when it comes to disaster recovery.

Microsoft’s Hyper-V DR solution is Windows Server Failover Clustering in a multi-site cluster configuration (see video demonstration). In this configuration the performance and behavior is the same as a local area cluster, but yet it can span data centers. What this means is that you can actually move your virtual machines across data centers with little to no perceivable downtime. Failback is the same process, just point and click to move the virtual machine resource back to the primary data center. While there is no built in “DR Testing”, I think it is preferable to do an actual DR test in just the matter of a minute or two with no perceivable downtime. The other thing I like about WSFC multi-site clusters is that the replication options include not only array-based replication vendors, but also host-based replication vendors. This really gives you a wide range of replication solutions in all price ranges and does not require that you upgrade your existing storage infrastructure.

Fault Tolerance

Fault tolerance basically eliminates the need to reboot a virtual machine in the event of an unexpected failure. VMware has the edge here in that it offers VMware FT. There are a few other 3rd party hardware and software vendors that play in this space as well. There are plenty of limitations and requirements when it comes to implementing FT systems, but if you need to ensure that a hardware component failure results in zero downtime vs. the minute or two it takes to boot up a VM in a standard HA configuration, then this is an option that you may want to consider. You probably want to make sure that your existing servers are already chock full of hot standby CPUs, RAM, power supplies, etc, and you have redundant paths to the network and storage, otherwise you may be throwing good money after bad. Fault tolerance is great for protection from hardware failures, but what happens if your application or the virtual machine’s operating system is behaving badly? That is when you need application level clustering as described below.

Application Availability

Everything I have discussed up to this point really only takes into consideration the health of your physical servers and your virtual machines as a whole. This is all well and good, however, what happens if your virtual machine blue screens? Or what if that latest SQL service pack broke your application? In those cases, none of these solutions are going to do you one bit of good. For those most critical applications, you really must cluster at the application layer. What this means is that you must look into clustering solutions that run within the OS on the virtual machine vs. within the hypervisor. In the Microsoft world this means MSCS/WSFC or 3rd party clustering solutions. Your storage options, when clustering within the virtual machine, are limited in scope to either iSCSI targets or host-based replication solutions. A demonstration of SQL Server being clustered within a Hyper-V VM using SteelEye DataKeeper Cluster Edition is available here. Currently, VMware really does not have a solution to this problem and would defer to solutions that run within the virtual machine for application layer monitoring.

Summary

With the advent of virtualization, it is really not a question of if you need availability, but more of a question of what availability option will help meet your SLA and/or DR requirements. I hope that this information helps you make sense of the availability options available to you.

Making sense of virtualization availability options