I was recently at a customers site at was doing an implementation of a Flexpod. The client wanted to do iSCSI booting of their ESXi 5.0 servers, as there was no local storage, nor was the environment capable of Fiber Channel. This experience was quite an interesting one as we originally were unable to get the iSCSI booting to work correctly.
What i experienced was an error during the install process of ESXi 5.0. We were able to see the NetApp LUN correctly during the installer, however when the installer hit 90%, we got the following error; “Expecting 2 bootbanks, found 0”. Doing an Alt-F12 during the install and watching the logs more closely today, at ~90% the installer claims that there is no IP address and begins to look for DHCP, in addition claiming it can no longer see the disk. The odd thing is during the configuration of the Service Profile and the iSCSI NIC, at no time did we choose DHCP, we choose Pooled. Since there is no DHCP Server in that subnet it doesn’t pickup an address and thus loses connectivity to the LUN.
The installer would then error out and the only option was to reboot. However, the odd part is that ESXi would actually boot from the iSCSI LUN, however it was an incomplete install, as none of the settings specified during setup were taken, and the VMware Tools ISOs are not available to any VMs on the host, in fact they don’t appear to be loaded on the host.
To make a long story short, we called Cisco TAC, and we were told “Currently, there are NO UCS blades that are officially supported to work for iSCSI Boot with ESXi 5.0, the latest version to be supported is ESXi 4.1″. We tried ESX 4.1, it works perfectly. I also had spoken with another colleague who was going through the same issues, and they stated they had fixed it, however as it turns out it was a “lucky fix”. So something changed with the ESXi installer from 4.1 to 5.0, and the way it was handling the NICs.
When i returned on site after the Christmas break, we tried the “lucky fix” with no success. We then looked into the ESXi Installer logs a lot closer, trying to figure out what exactly was going on. What we saw, this time by looking closer and trying to decrypt the sometimes vague messages, it appeared to be resetting the first NIC and looking for a management IP, NOT the iSCSI IP. We then looked at the Service Profile configuration, more specifically the “vNIC/vHBA Placement Policy”. We had set a specific placement policy so that certain NICs would always be the same, ie. vmnic1,2, etc. In that placement policy we had the HBA of the server, followed by the iSCSI Overlay vNIC then the rest of the normal vNICs. We then moved the Overlay vNIC to the bottom of the order and re-ran the installer, it worked perfectly this time. Watching the logs this time, it was still resetting that first NIC and looking for DHCP, however this time the Overlay vNIC wasn’t the first one, so the iSCSI adapter was maintaining its’ IP and thus connectivity to the disk. We were able to repeat this fix on every other server.
So while there still is an issue Cisco & VMware need to work out, and if it’s a firmware fix or simply a procedural how-to is yet to be determined.
In the meantime I’ve created a “How-to” of my own. This document covers the iSCSI Configuration portion of a Flexpod install, it deals with the UCS portion as well as the Netapp. It does take some assumptions that you know how to navigate the GUIs. I did use Filerview for this doc, however the premise is the same if using System Manager or the CLI.
FlexPod iSCSI Boot-Fixed