Every time I read a blog post, or open a magazine article about virtualization and disaster recovery I see the same thing….VMware has a more robust DR solution than Microsoft. Well, I’d like to challenge that assumption. From the view where I sit, this is actually one of the areas where Microsoft has a major competitive advantage at the moment. Here is how I see it.
VMware Site Recovery Manager
This is an optional additional add on that rides on the back of Array based replication solutions. While the recovery point objective is good due to the array based replication, the RTO is measured in hours, not minutes. Add in the fact that moving back to the primary data center is a very manual procedure which basically requires that you re-create your jobs in the opposite direction; the complete end to end recovery operation of failover and failback could take the better part of a day or longer.
Microsoft Multi-Site Cluster
Virtual machine HA clustering is included with the free version of Hyper-V Server 2008 R2, as well as with Windows Server 2008 Enterprise and Datacenter editions. In order to do multi-site clusters, it requires array based replication or host based replication solutions that integrate with Windows Server Failover Clustering. With a multi-site cluster, failover is measured in minutes (just about the time it takes to start a VM) and can be used with array based replication solutions such as EMC SRDF CE or HP MSA CLX or the much less expensive host based replication solutions such as SteelEye DataKeeper Cluster Edition.
Not only is failover quick with Hyper-V multi-site clusters, measured in just a few minutes, failback is also quick and seamless as well. Add in support for Live Migrations or Quick Migration across Data Centers, I think this is one area that Microsoft actually has a much more robust solution than VMware. Maybe it does not included automated DR tests, but when you consider you can failover and failback all in under 10 minutes, maybe an actual DR test performed monthly would give you a much better indication of what to expect in an actual disaster?
If you want a Hyper-V solution more like SRM, then there is an option there as well, it is called Citrix Essential for Hyper-V. But much like SRM, it is an optional add-on feature and really doesn’t even match the RPO and RTO features that you can achieve with basic multi-site clusters for Hyper-V.
What do you think? Am I wrong or is there something I just don’t get? From my view, Hyper-V is heads and shoulders above vSphere in terms of disaster recovery features.
There has recently been a lot of press heralding VMware’s limited support for vMotion across Data Centers, or “long-distance vMotion” as I have seen it called. The details of the solution can be found on Cisco’s website here. While I think that is just great, I’d like to remind people that Microsoft Hyper-V has this same functionality today and has a lot less requirements and restrictions than VMware’s long-distance vMotion.
Where VMware has VMwareHA, vMotion and Site Recovery Manager (SRM) to take care of virtual machine availability, Microsoft provides the same functionality with Windows Server Failover Clustering and in fact in some cases goes beyond what VMware can provide in terms of virtual machine availability as I described in a previous post.
What I’d like to focus on today is Microsoft’s competitive offering to “long-distance vMotion”. To achieve the same functionality in Hyper-V, you simply deploy a multi-site Hyper-V cluster using Windows Server Failover Clustering and your favorite host or storage based replication solution that is certified to work in a Windows Server 2008 multi-site cluster. By doing this, you can use your existing network infrastructure and your existing storage infrastructure to do Live Migrations across data centers. As far as requirements, they really are the same as any multi-site cluster, except I would recommend that you span your subnets to avoid client reconnection issues that occur when moving a virtual machine to a new subnet, as the clients could cache to old IP address until the TTL expires.
With the recent release of Microsoft Windows Server 2008 R2 and vSphere 4.0, I thought it was a good time to review some of the options available when considering the availability of your virtual servers and the applications running on them. I also will take this opportunity to describe some of the features that enable virtual machine availability. I have grouped these features into their function roles to help highlight their purpose.
Live Migration and VMware’s VMotion are both solutions that allow an administrator to move a virtual machine from one physical server to another with no perceivable downtime. The key thing to remember about this technology is that in order to move a virtual machine from one server to another without any downtime, the move must be a planned event. The reason that it must be a planned event is that the virtual machine’s memory must be synchronized between the servers before the actual switchover can occur. This is true of both Microsoft’s and VMware’s solutions. Also keep in mind that both of these technologies require the use of shared storage to hold the virtual hard disks (VMDK and VHD files), which limits Live Migration and VMotion to local area networks. This also means that any downtime planned for the storage array must be handled in a different way if you want to limit the impact to your virtual machines.
Microsoft’s Windows Server Failover Clustering and VMware’s High Availability (HA) are the solutions that are available to protect virtual machines in the event of unplanned downtime. Both solutions are similar in that they monitor virtual machines for availability and in the case of a failure the VMs are moved to the standby node. This recovery process requires that the machines be rebooted since there was no time to sync the memory before failover.
How do I recover my virtual machines in the event of a complete site loss? The good news is that virtualization makes this process a whole lot easier since a virtual machine is just a file that can be picked up and moved to another server. While up to this point VMware and Microsoft are pretty similar in their availability features and functionality, but here is where Microsoft really shines. VMware offers Site Recovery Manager which is a fine product, but is limited in support to only SRM-certified array-based replication solutions. Also, the failover and failback process is not trivial and can take the better part of a day to do a complete round trip from the DR site back to the primary data center. It does have some nice features like DR testing, but in my experience with Microsoft’s solution for disaster recovery they have a much better solution when it comes to disaster recovery.
Microsoft’s Hyper-V DR solution is Windows Server Failover Clustering in a multi-site cluster configuration (see video demonstration). In this configuration the performance and behavior is the same as a local area cluster, but yet it can span data centers. What this means is that you can actually move your virtual machines across data centers with little to no perceivable downtime. Failback is the same process, just point and click to move the virtual machine resource back to the primary data center. While there is no built in “DR Testing”, I think it is preferable to do an actual DR test in just the matter of a minute or two with no perceivable downtime. The other thing I like about WSFC multi-site clusters is that the replication options include not only array-based replication vendors, but also host-based replication vendors. This really gives you a wide range of replication solutions in all price ranges and does not require that you upgrade your existing storage infrastructure.
Fault tolerance basically eliminates the need to reboot a virtual machine in the event of an unexpected failure. VMware has the edge here in that it offers VMware FT. There are a few other 3rd party hardware and software vendors that play in this space as well. There are plenty of limitations and requirements when it comes to implementing FT systems, but if you need to ensure that a hardware component failure results in zero downtime vs. the minute or two it takes to boot up a VM in a standard HA configuration, then this is an option that you may want to consider. You probably want to make sure that your existing servers are already chock full of hot standby CPUs, RAM, power supplies, etc, and you have redundant paths to the network and storage, otherwise you may be throwing good money after bad. Fault tolerance is great for protection from hardware failures, but what happens if your application or the virtual machine’s operating system is behaving badly? That is when you need application level clustering as described below.
Everything I have discussed up to this point really only takes into consideration the health of your physical servers and your virtual machines as a whole. This is all well and good, however, what happens if your virtual machine blue screens? Or what if that latest SQL service pack broke your application? In those cases, none of these solutions are going to do you one bit of good. For those most critical applications, you really must cluster at the application layer. What this means is that you must look into clustering solutions that run within the OS on the virtual machine vs. within the hypervisor. In the Microsoft world this means MSCS/WSFC or 3rd party clustering solutions. Your storage options, when clustering within the virtual machine, are limited in scope to either iSCSI targets or host-based replication solutions. A demonstration of SQL Server being clustered within a Hyper-V VM using SteelEye DataKeeper Cluster Edition is available here. Currently, VMware really does not have a solution to this problem and would defer to solutions that run within the virtual machine for application layer monitoring.
With the advent of virtualization, it is really not a question of if you need availability, but more of a question of what availability option will help meet your SLA and/or DR requirements. I hope that this information helps you make sense of the availability options available to you.