Hello,
We have Dell FX2 chassis, a total of 8 FC640 servers. The networking exits the rear of the chassis, 4 SFP+ ports per server. Two are used for network communication and two for storage communication. 4 port Gigabit nics are installed in the chassis pci slots to use two ports for ESXi management, and provide two "TAP" networks per ESXi, helpful for mirroring traffic to virtual appliances like web filtering, network monitoring, etc..
We had a situation Monday evening where one blade lost connection to shared storage. What was odd about this is that according to ESXi, the 10 gig interfaces were all up at 10000mbps full duplex. However on the switch side (A brocade TI24x) the port indicated it was down. There are two of these switches, and they are cross connected so I'm not sure why it couldn't reach storage from the second nic. The storage network on this distributed vSwitch has active uplinks dvUplink2 and standby uplinks dvUplink1. The vMotion network on this vSwitch is the opposite, dvUplink1 is active, dvUplink2 is standby. The modes are Route based on originating virtual port, Link Status only and Yes to notify switches and failback. I'm wondering if it didn't failover because for some reason the server thought the link was up? I wonder if it always thinks the link is up in a chassis because the nic in the server blade is connected through to the chassis backplane, where the 16 SFP+ ports are (this is a 4 blade, 2U chassis).
So I was reading on changing failover from Link Status only to beacon probing. However I've seen some documentation that recommends THREE nics, and not all connected to the same switch. Well we are using all of our 10 gig nics so thats not an option. Any way beacon probing would work reliably with just two nics?
Could both uplinks be combined in a port channel? Thinking of eliminating two separtate switch "islands" (brocade ti24x) with somethign that can stack or MLAG and do LACP between them.
This is for jumbo frame 10gig ethernet for NFS and vmotion. These switches are not connected to the network, except for an OOB management port for SNMP / CLI.
The resolution to this was to hard power off the host, then HA took over and powered on the affected VM's on other working hosts.
The next day I replaced the SFP+ module on the chassis side and was able to link up. I was only able to ping random IP's on the storage network though. I had to power off the ESXi host and remove the blade from the chassis to fully de-energize it. Upon reinsertion and boot up, the ESXi server appears fine and we are now running non-critical vm's on it without issues.