Using HP SAN Virtualization Services Platform (SVSP) in a vSphere stretched cluster
Recently I was involved in evaluating the integration of HP’s StorageWorks SVSP solution with a vSphere stretched cluster design.
Hewlett Packard’s SVSP is a HP’s product that I classify as a “storage enabler”. Storage enablers add a layer between one or more physical SANs and the hosts accessing LUNs. This layer adds features that do not exist with the existing SANs. An example of this layer includes aggregating storage between dissimilar SANs, LUN replication, LUN cloning, and sync mirroring.
This particular customer is using vSphere’s HA feature as a DR solution. Their plan was to use vSphere’s HA feature in conjunction with geographically dispersed storage and the sync-mirror feature of SVSP to achieve high availability automation in the case of one of the sites hosting half of the cluster goes down.
The sync mirror feature takes an active LUN and synchronizes the data at the block level to a passive mirror of that LUN.
In this specific case, the customer had redundant dark fiber connections between sites with two XP24000s, one at each site.
After evaluating this proposed configuration, I found some significant problems with this approach. Would it provide a DR solution? Yes, but I feel a more appropriate question is… Will it provide a reliable and stable DR solution at a reasonable cost??
As I worked with the customer and did my own research I found there is no evidence that VMware officially supports stretched clusters and because of this, the actual supportability for them is questionable. You do not want to have to call VMware for support and be surprised by their demand that you redesign your cluster to start getting support.
It is only recently that major players such as NetApp, EMC, and Cisco have been looking at how geographically dispersed datacenters can be configured to support core functional features of vSphere like vMotion. The efforts show that this functionality is being looked at and tested. I believe full support for this type of DR solution is coming however we are not quite there yet and with today’s high speed WAN/Internet links available, I see this as becoming an increased topic of discussion.
While working on this project, I performed a series of tests in order to document findings, determine actual DR success, and capture specific data which would be used at a later date to recommend future solutions.
To outline some of my findings and to provide an example of some of the tests I performed I’ve shared some info below:
Test 1 – Failure testing - vSphere HA functionality was tested and verified to work properly with local and remote storage during tests of hard shutdowns of ESX hosts. Breaking the Fiber Channel ISL link between sites A & B for an extended period of time broke any VMs whose storage is hosted remotely. It was noticed that the VMs were able to recover from storage disconnects less than 2 minutes, but anything longer than that and the VMs died a slow death. Even at the less than 2 minute mark, this means that data corruption could still occur to a production box trying to read/write data to the SVSP LUN. As of this writing, vSphere does not have any method to recognize that a VM has lost access to storage and compensate for it. We can’t utilize multi-pathing in this capacity due to the separate sites (i.e. it’s not active/active storage).
Test 1 Findings – I recommend against hosting storage remotely. Have active storage located at each site and each VM accessing its storage locally. This can be guaranteed with a split cluster design or turn off DRS.
Test 2 – Performance testing – When doing the testing we monitored “Total I/O per Second” & “Average I/O Response time (ms)”. When the SVSP sync mirror link is broken, the active SVSP LUN has to have changes journaled and once communication is re-established with the passive LUN, it has to re-sync. I/O performance drops for all VMs with VMDKs residing on the active SVSP LUNs during re-sync of sync-mirror LUN. I noticed that during the re-sync process VMs that had their storage hosted by the active site had a performance drop of 8% and those hosted remotely of 20%.
Let’s look deeper though as to what this means from a VMware design aspect.
Problem #1 – Limits on cluster size and HA functionality at risk.
HA functionality is controlled by Primary and Secondary roles. These roles are assigned to the first five hosts added to the cluster and can change whenever a host level HA event takes place, like putting host into maintenance mode or removing/adding a host to the cluster. This entire process is automated by vSphere currently there is no supported way of controlling this process. Because of this, stretched clusters should be limited to a maximum of 8 nodes. If we go beyond this number and have five nodes at one site, it is possible that each of your primary HA nodes would be located at a single site. This means the second site would be vulnerable to not having HA functionality if the first site were to go down.
Problem #2 – DRS has no site affinity awareness.
Currently DRS doesn’t have the capability to know which side of the cluster virtual machines reside at. DRS could potentially move VMs between sites and create a situation where bandwidth of the site links and applications within the VMs could be negatively affected.
It was my recommendation to our client that they abandon the stretched cluster design and implement a proven DR solution such as SRM. In this specific environment, SRM is a perfect solution to provide the level of high availability they were looking for. With active and passive storage at both sites, the local active storage at each site would replicate to the passive at the remote site. This configuration would allow for the splitting of the cluster, allowing for better resource utilization across all hosts, direct insight and control as to where specific VM’s were running, and a fully supported solution by VMware.
Finally, implementing this type of design with SRM would mean you are only failing over 50% of your total VMs in the case of site failover thereby reducing the recovery complexity of any DR event.