It is official, I passed exam 70-652 today and I am now a MCTS: Windows Server Virtualization, Configuration. It was 11 years ago that I sat for my first NT 4 exam and now about a dozen exams later I am just now embarking on updating my credentials to the latest and greatest, once again. I think certifications are a good thing, but certainly don’t replace real world experience and good Google skills when it comes to diagnosing a problem or planning a new project. I’ll keep you posted on my progress; hopefully I’ll be able to complete MCITP: Enterprise Administrator before my kids get out of school in June so I can enjoy the summer.
Microsoft has recently updated their Virtualization Continuity page with some good information…
Cross-Site Disaster Recovery Solutions
Implementing a reliable, rapid-recovery strategy can be time-consuming to implement and expensive to manage. Because of the complexity and cost, many companies simply don’t have comprehensive business continuity plans to protect their data and ensure application availability.
Virtualization has been a game changer for many companies. With virtualization based Site Recovery solutions, you can ensure higher availability and business continuity options. Windows Server provides support for a wide range of industry leading, shared storage solutions to deliver Quick and Live Migration. Combined with partner cross-site data management and replication technologies, Microsoft is offering complete Site Recovery solutions.
In summary, Microsoft Site Recovery solutions provide these key benefits:
- Bullet proof application and data availability across a range of applications
- Site-wide disaster recovery that can help you gain immediate and long-term operational and capital benefits
- Automated fail-over and fail back based on clustering and data resynchronization delivering superior application and data availability, for planned and unplanned downtime
Also, they have recently published a white paper entitled “Microsoft End-to-End Cross-Site Disaster Recovery Solutions“. This is a must read for anyone deploying SteelEye DataKeeper in a Cross-Site Disaster Recovery configuration.
Check out this podcast where Carryl Roy from Virtual Strategy Magazine interviews me about virtualization availability.
With the launch of Windows Server 2008 R2 and the interest in DataKeeper replication solutions for Hyper-V, I have been pretty busy – which is a good thing! Recently, I have been speaking with some Microsoft Gold Partners who are busy installing Hyper-V at their customer locations and looking at some of the disaster recovery options, including multi-site Hyper-V clusters with SteelEye DataKeeper Cluster Edition. Many of the questions are the same each time and the demonstration is always the same. I figured it may be beneficial to produce a video that talks specifically about one of the questions – the difference between Live Migration and Quick Migration and when to use one vs. the other. This video that demonstrates Live Migration and Quick Migration while discussing some of the things to consider may be of interest to you.
Hyper-V pass-through disk performance vs. fixed size VHD files and dynamic VHD files in Windows Server 2008 R2
With the release of Windows Server 2008 R2, one of the enhancements was improving the performance of dynamic VHD files. Prior to R2, writes to dynamically expanding VHD files could be 3x slower than writes to a fixed size VHD file due to limited meta data caching. Overall, Microsoft is claiming the performance of dynamic VHD files vs. fixed size VHD files is almost identical.
Pass-through disks are another option when configuring a Hyper-V VM. According to my results, the performance of a pass-through disk is marginally better than that of VHD files. However, if you use pass-through disks you lose all of the benefits of VHD files such as portability, snap-shotting and thin provisioning. Considering these trade-offs, using pass-through disks should really only be considered if you require a disk that is greater than 2 TB in size or if your application is I/O bound and you really could benefit from another .1 ms shaved off your average response time.
I figured rather than taking Microsoft’s word for it, I would put these different disk types to the test for myself. I set up a Hyper-V Windows Server 2008 R2 virtual machine running on top of Windows Server 2008 R2. For my Hypervisor, I used Dell PowerEdge 1950 server attached to a Dell AX150 SAN and carved off three 10 GB LUNs for use in my test. In Hyper-V Manager, I added three new disks, one pass-through, one dynamic VHD and one fixed-size VHD. I then used IOMeter to test the performance of the disks. The test parameters and raw data can be found in this CSV file.
The charts below summarize my results. As you can see, on the extremes (Maximum/Minimum), the pass-through disk wins in most cases. However, on average, there is almost no difference between the performances of the three different disk types.
The benefit of thin provisioning, meaning building a VHD file or multiple VHD files with a combined size that is greater than the available disk space, and the portability of VHD files make dynamically expanding VHD files the obvious choice for most Windows Server 2008 R2 virtual machines.
In summary, I’d strongly consider the use of dynamically expanding VHD files for your next Hyper-V deployment on Windows Server 2008 R2.
Integrating your storage replication with Failover Clustering
In Part 1 of this series, we took a look at the first steps required for building a multi-site cluster. We got to the point where we had a two node cluster that used a node and file share majority quorum, with no resources yet defined. In this section we will start where we left off and look at how your replication solution will integrate with your failover clustering. Because each vendor’s replication solution will be implemented differently, it is hard to have one document that describes them all. The important thing to remember is that you want to purchase a replication solution that integrates with failover clustering and is certified by Microsoft. Your choices are basically array based, appliance based or host based replication solutions. EMC makes both appliance-based and array-based replication solutions and seem to do a great job at both. EMC’s John Toner maintains a blog that is dedicated to Geographically Dispersed Clusters and if you are going the EMC route, I’m sure he could lead you in the right direction. All the major vendors have solutions, you will just need to contact them to get the details.
For this demonstration, I’m going to use a host based replication solution, SteelEye DataKeeper Cluster Edition, from my company, SteelEye Technology. It is so easy, that I thought instead of doing a long article, I would just record the steps and share it with you in a video. One of the advantages of host based replication is that you can utilize your existing storage, whether it is just some local attached disks, iSCSI or an expensive SAN. Host based replication can replicate across any storage devices.
Here is a summary of what you will see in the video.
Launch the SteelEye DataKeeper MMC Snap-in
- Create a new DataKeeper job, define mirror end points, network, compression, etc.
Launch the Failover Cluster MMC Snap-in
- Create a Hyper-V resource
- Add a DataKeeper Volume Resource
- Edit the properties of the DataKeeper Volume resource to associate it with the mirror created earlier
- Make the Virtual Machine configuration dependent upon the new DataKeeper volume resource
That’s it! You are now done. Sit back and enjoy your new Hyper-V multi-site cluster.
In Part 3 of this series, we will tackle SQL 2008 multi-site clusters on Windows Server 2008 R2. There are a few more steps and some tips and tricks you will definitely need to know, so make sure you check back to get all of the details. In the meantime, if you need assistance, leave me a comment or contact me through SIOS and I’d be glad to help you out.
There has recently been a lot of press heralding VMware’s limited support for vMotion across Data Centers, or “long-distance vMotion” as I have seen it called. The details of the solution can be found on Cisco’s website here. While I think that is just great, I’d like to remind people that Microsoft Hyper-V has this same functionality today and has a lot less requirements and restrictions than VMware’s long-distance vMotion.
Where VMware has VMwareHA, vMotion and Site Recovery Manager (SRM) to take care of virtual machine availability, Microsoft provides the same functionality with Windows Server Failover Clustering and in fact in some cases goes beyond what VMware can provide in terms of virtual machine availability as I described in a previous post.
What I’d like to focus on today is Microsoft’s competitive offering to “long-distance vMotion”. To achieve the same functionality in Hyper-V, you simply deploy a multi-site Hyper-V cluster using Windows Server Failover Clustering and your favorite host or storage based replication solution that is certified to work in a Windows Server 2008 multi-site cluster. By doing this, you can use your existing network infrastructure and your existing storage infrastructure to do Live Migrations across data centers. As far as requirements, they really are the same as any multi-site cluster, except I would recommend that you span your subnets to avoid client reconnection issues that occur when moving a virtual machine to a new subnet, as the clients could cache to old IP address until the TTL expires.
Creating your cluster and configuring the quorum: Node and File Share Majority
Welcome to Part 1 of my series “Step-by-Step: Configuring a 2-node multi-site cluster on Windows Server 2008 R2″. Before we jump right in to the details, let’s take a moment to discuss what exactly a multi-site cluster is and why I would want to implement one. Microsoft has a great webpage and white paper that you will want to download to get you all of the details, so I won’t repeat everything here. But basically a multi-site cluster is a disaster recovery solution and a high availability solution all rolled into one. A multi-site cluster gives you the highest recovery point objective (RTO) and recovery time objective (RTO) available for your critical applications. With the introduction of Windows Server 2008 failover clustering a multi-site cluster has become much more feasible with the introduction of cross subnet failover and support for high latency network communications.
I mentioned “cross-subnet failover” as a great new feature of Windows Server 2008 Failover Clustering, and it is a great new feature. However, SQL Server has not yet embraced this functionality, which means you will still be required to span your subnet across sites in a SQL Server multi-site cluster. As of Tech-Ed 2009, the SQL Server team reported that they plan on supporting this feature, but they say it will come sometime after SQL Server 2008 R2 is released. For the foreseeable future you will be stuck with spanning your subnet across sites in a SQL Server multi-site cluster. There are a few other network related issues that you need to consider as well, such as redundant communication paths, bandwidth and file share witness placement.
All Microsoft failover clusters must have redundant network communication paths. This ensures that a failure of any one communication path will not result in a false failover and ensures that your cluster remains highly available. A multi-site cluster has this requirement as well, so you will want to plan your network with that in mind. There are generally two things that will have to travel between nodes: replication traffic and cluster heartbeats. In addition to that, you will also need to consider client connectivity and cluster management activity. You will want to be sure that whatever networks you have in place, you are not overwhelming the network or you will have unreliable behavior. Your replication traffic will most likely require the greatest amount of bandwidth; you will need to work with your replication vendor to determine how much bandwidth is required.
With your redundant communication paths in place, the last thing you need to consider is your quorum model. For a 2-node multi-site cluster configuration, the Microsoft recommended configuration is a Node and File Share Majority quorum. For a detailed description of the quorum types, have a look at this article.
The most common cause of confusion with the Node and File Share Majority quorum is the placement of the File Share Witness. Where should I put the server that is hosting the file share? Let’s look at the options.
Option 1 – place the file share in the primary site.
This is certainly a valid option for disaster recovery, but not so much for high availability. If the entire site fails (including the Primary node and the file share witness) the Secondary node in the secondary site will not come into service automatically, you will need to force the quorum online manually. This is because it will be the only remaining vote in the cluster. One out of three does not make a majority! Now if you can live with a manual step being involved for recovery in the event of a disaster, then this configuration may be OK for you.
Option 2 – place the file share in the secondary site.
This is not such a good idea. Although it solves the problem of automatic recovery in the event of a complete site loss, it exposes you to the risk of a false failover. Consider this…what happens if your secondary site goes down? In this case, your primary server (Node1) will go also go offline as it is now only a single node in the primary site and will no longer have a node majority. I can see no good reason to implement this configuration as there is too much risk involved.
Option 3 – place the file share witness in a 3rd geographic location
This is the preferred configuration as it allows for automatic failover in the event of a complete site loss and eliminates any the possibility of a failure of the secondary site causing the primary node to go offline. By having a 3rd site host the file share witness you have eliminated any one site as a single point of failure, so now the cluster will act as you expect and automatic failover in the event of a site loss is possible. Identifying a 3rd geographic location can be challenging for some companies, but with the advent of cloud based utility computing like Amazon EC2 and GoGrid, it is well within the reach of all companies to put a file share witness in the clouds and have the resiliency required for effective multi-site clusters. In fact, you may consider the cloud itself as your secondary data center and just failover to the cloud in the event of a disaster. I think the possibilities of cloud based computing and disaster recovery configurations are extremely enticing and in fact I plan on doing a whole blog post on a just that in the near future.
Configure the Cluster
Now that we have the basics in place, let’s get started with the actual configuration of the cluster. You will want to add the Failover Clustering feature to both nodes of your cluster. For simplicity sake, I’ve called my nodes PRIMARY and SECONDARY. This is accomplished very easily through the Add Features Wizard as shown below.
Figure 1 – Add the Failover Clustering Role
Next you will want to have a look at your network connections. It is best if you rename the connections on each of your servers to reflect the network that they represent. This will make things easier to remember later.
Figure 2- Change the names of your network connections
You will also want to go into the Advanced Settings of your Network Connections (hit Alt to see Advanced Settings menu) of each server and make sure the Public network is first in the list.
Figure 3- Make sure your public network is first
Your private network should only contain an IP address and Subnet mask. No Default Gateway or DNS servers should be defined. Your nodes need to be able to communicate across this network, so make sure the servers can communicate across this network; add static routes if necessary.
Figure 4 – Private network settings
Once you have your network configured, you are ready to build your cluster. The first step is to “Validate a Configuration”. Open up the Failover Cluster Manager and click on Validate a Configuration.
Figure 5 – Validate a Configuration
The Validation Wizard launches and presents you the first screen as shown below. Add the two servers in your cluster and click Next to continue.
Figure 6 – Add the cluster nodes
A multi-site cluster does not need to pass the storage validation (see Microsoft article). Toskip the storage validation process,click on “Run only the tests I select” and click Continue.
Figure 7 – Select “Run only tests I select”
In the test selection screen, unselect Storage and click Next
Figure 8 – Unselect the Storage test
You will be presented with the following confirmation screen. Click Next to continue.
Figure 9 – Confirm your selection
If you have done everything right, you should see a summary page that looks like the following. Notice that the yellow exclamation point indicates that not all of the tests were run. This is to be expected in a multi-site cluster because the storage tests are skipped. As long as everything else checks out OK, you can proceed. If the report indicates any other errors, fix the problem, re-run the tests, and continue.
Figure 10 – View the validation report
You are now ready to create your cluster. In the Failover Cluster Manager, click on Create a Cluster.
Figure 11 – Create your cluster
The next step asks whether or not you want to validate your cluster. Since you have already done this you can skip this step. Note this will pose a little bit of a problem later on if installing SQL as it will require that the cluster has passed validation before proceeding. When we get to that point I will show you how to by-pass this check via a command line option in the SQL Server setup. For now, choose No and Next.
Figure 12 – Skip the validation test
The next step is that you must create a name for this cluster and IP for administering this cluster. This will be the name that you will use to administer the cluster, not the name of the SQL cluster resource which you will create later. Enter a unique name and IP address and click Next.
Note: This is also the computer name that will need permission to the File Share Witness as described later in this document.
Figure 13 – Choose a unique name and IP address
Confirm your choices and click Next.
Figure 14 – Confirm your choices
Congratulation, if you have done everything right you will see the following Summary page. Notice the yellow exclamation point; obviously something is not perfect. Click on View Report to find out what the problem may be.
Figure 15 – View the report to find out what the warning is all about
If you view the report, you should see a few lines that look like this.
Figure 16 – Error report
Don’t fret; this is to be expected in a multi-site cluster. Remember we said earlier that we will be implementing a Node and File Share Majority quorum. We will change the quorum type from the current Node Majority Cluster (not a good idea in a two node cluster) to a Node and File Share Majority quorum.
Implementing a Node and File Share Majority quorum
First, we need to identify the server that will hold our File Share witness. Remember, as we discussed earlier, this File Share witness should be located in a 3rd location, accessible by both nodes of the cluster. Once you have identified the server, share a folder as you normally would share a folder. In my case, I create a share called MYCLUSTER on a server named DEMODC.
The key thing to remember about this share is that you must give the cluster computer name read/write permissions to the share at both the Share level and NTFS level permissions. If you recall back at Figure 13, I created my cluster and gave it the name “MYCLUSTER”. You will need to make sure you give the cluster computer account read/write permissions as shown in the following screen shots.
Figure 17 – Make sure you search for Computers
Figure 18 – Give the cluster computer account NTFS permissions
Figure 19 – Give the cluster computer account share level permissions
Now with the shared folder in place and the appropriate permissions assigned, you are ready to change your quorum type. From Failover Cluster Manager, right-click on your cluster, choose More Actions and Configure Cluster Quorum Settings.
Figure 20 – Change your quorum type
On the next screen choose Node and File Share Majority and click Next.
Figure 21 – Choose Node and File Share Majority
In this screen, enter the path to the file share you previously created and click Next.
Figure 22 – Choose your file share witness
Confirm that the information is correct and click Next.
Figure 23 – Click Next to confirm your quorum change to Node and File Share Majority
Assuming you did everything right, you should see the following Summary page.
Figure 24 – A successful quorum change
Now when you view your cluster, the Quorum Configuration should say “Node and File Share Majority” as shown below.
Figure 25 – You now have a Node and File Share Majority quorum
The steps I have outlined up until this point apply to any multi-site cluster, whether it is a SQL, Exchange, File Server or other type of failover cluster. The next step in creating a multi-site cluster involves integrating your storage and replication solution into the failover cluster. This step will vary from depending upon your replication solution, so you really need to be in close contact with your replication vendor to get it right. In Part 2 of my series, I will illustrate how SteelEye DataKeeper Cluster Edition integrates with Windows Server Failover Clustering to give you an idea of how one of the replication vendor’s solutions works.
Other parts of this series will describe in detail how to install SQL, File Servers and Hyper-V in multi-site clusters. I will also have a post on considerations for multi-node clusters of three or more nodes.
I downloaded the Windows Server 2008 R2 bits this afternoon and spent the rest of the afternoon installing it and testing the long awaited Live Migration. Well, as I have seen in the numerous demos, it works great! And to top it off, I did a Live Migration in a multi-site cluster with DataKeeper Cluster Edition and it all worked great as expected. Keep your eyes peeled for a new video that will show you how to set up a Hyper-V disaster recovery configuration using Window Server 2008 R2 in a multi-site cluster with SteelEye DataKeeper. I’ll post the link just as soon as it is ready.
With the recent release of Microsoft Windows Server 2008 R2 and vSphere 4.0, I thought it was a good time to review some of the options available when considering the availability of your virtual servers and the applications running on them. I also will take this opportunity to describe some of the features that enable virtual machine availability. I have grouped these features into their function roles to help highlight their purpose.
Live Migration and VMware’s VMotion are both solutions that allow an administrator to move a virtual machine from one physical server to another with no perceivable downtime. The key thing to remember about this technology is that in order to move a virtual machine from one server to another without any downtime, the move must be a planned event. The reason that it must be a planned event is that the virtual machine’s memory must be synchronized between the servers before the actual switchover can occur. This is true of both Microsoft’s and VMware’s solutions. Also keep in mind that both of these technologies require the use of shared storage to hold the virtual hard disks (VMDK and VHD files), which limits Live Migration and VMotion to local area networks. This also means that any downtime planned for the storage array must be handled in a different way if you want to limit the impact to your virtual machines.
Microsoft’s Windows Server Failover Clustering and VMware’s High Availability (HA) are the solutions that are available to protect virtual machines in the event of unplanned downtime. Both solutions are similar in that they monitor virtual machines for availability and in the case of a failure the VMs are moved to the standby node. This recovery process requires that the machines be rebooted since there was no time to sync the memory before failover.
How do I recover my virtual machines in the event of a complete site loss? The good news is that virtualization makes this process a whole lot easier since a virtual machine is just a file that can be picked up and moved to another server. While up to this point VMware and Microsoft are pretty similar in their availability features and functionality, but here is where Microsoft really shines. VMware offers Site Recovery Manager which is a fine product, but is limited in support to only SRM-certified array-based replication solutions. Also, the failover and failback process is not trivial and can take the better part of a day to do a complete round trip from the DR site back to the primary data center. It does have some nice features like DR testing, but in my experience with Microsoft’s solution for disaster recovery they have a much better solution when it comes to disaster recovery.
Microsoft’s Hyper-V DR solution is Windows Server Failover Clustering in a multi-site cluster configuration (see video demonstration). In this configuration the performance and behavior is the same as a local area cluster, but yet it can span data centers. What this means is that you can actually move your virtual machines across data centers with little to no perceivable downtime. Failback is the same process, just point and click to move the virtual machine resource back to the primary data center. While there is no built in “DR Testing”, I think it is preferable to do an actual DR test in just the matter of a minute or two with no perceivable downtime. The other thing I like about WSFC multi-site clusters is that the replication options include not only array-based replication vendors, but also host-based replication vendors. This really gives you a wide range of replication solutions in all price ranges and does not require that you upgrade your existing storage infrastructure.
Fault tolerance basically eliminates the need to reboot a virtual machine in the event of an unexpected failure. VMware has the edge here in that it offers VMware FT. There are a few other 3rd party hardware and software vendors that play in this space as well. There are plenty of limitations and requirements when it comes to implementing FT systems, but if you need to ensure that a hardware component failure results in zero downtime vs. the minute or two it takes to boot up a VM in a standard HA configuration, then this is an option that you may want to consider. You probably want to make sure that your existing servers are already chock full of hot standby CPUs, RAM, power supplies, etc, and you have redundant paths to the network and storage, otherwise you may be throwing good money after bad. Fault tolerance is great for protection from hardware failures, but what happens if your application or the virtual machine’s operating system is behaving badly? That is when you need application level clustering as described below.
Everything I have discussed up to this point really only takes into consideration the health of your physical servers and your virtual machines as a whole. This is all well and good, however, what happens if your virtual machine blue screens? Or what if that latest SQL service pack broke your application? In those cases, none of these solutions are going to do you one bit of good. For those most critical applications, you really must cluster at the application layer. What this means is that you must look into clustering solutions that run within the OS on the virtual machine vs. within the hypervisor. In the Microsoft world this means MSCS/WSFC or 3rd party clustering solutions. Your storage options, when clustering within the virtual machine, are limited in scope to either iSCSI targets or host-based replication solutions. A demonstration of SQL Server being clustered within a Hyper-V VM using SteelEye DataKeeper Cluster Edition is available here. Currently, VMware really does not have a solution to this problem and would defer to solutions that run within the virtual machine for application layer monitoring.
With the advent of virtualization, it is really not a question of if you need availability, but more of a question of what availability option will help meet your SLA and/or DR requirements. I hope that this information helps you make sense of the availability options available to you.