Anyone having issues with Hyper-V hosts lately? Lazy NICs 'n virtual switches?

YeOldeStonecat · Apr 10, 2015

Thinking some weird update crept into some...but then again...I have Hyper-V hosts running for many months straight without an update....unless there was was of those stealthy ones Microsoft forces in occasionally even when updates are disabled.

Just had a number of Hyper-V hosts, over the past few weeks, end up with lazy virtual switches and even the local NIC of the host (like the one I'm working on now) not passing DNS requests. Although an IP connection is there...can ping public IP addresses but not resolve outside. And naturally guests....the same.

Yeah DNS set to many different things, flushed cache, restarted DNS services (both)...it's like the whole host needs a reboot (which of course is disruptive).

NETWizz · Apr 10, 2015

Having TONS of Hyper-V issues for hosts running on Server 2012 R2... Lately, it is NOT uncommon for two dozen VM servers to simply completely stop responding... out of a few hundred VMs!

Thus far Dell blames the network and Microsoft... Microsoft Blames the Network ... Cisco says they have nothing to do with it.
Replaced the Cisco Fiber Channel switches with Dell ones... same crap.

Our Hardware:
Dell Blades in a Dell Chassis
Comellent Fiber Channel SAN
Emulex Host-Bus-Adapters
10G Broadcom Network Adapters

We are experiencing two (2) issues. Dropped network/unreachable AND/OR Storage/SAN unreachable from impacted VMs.

I think the issues is VMQ (Virtual Machine Queuing) in my case.
http://www.aidanfinn.com/?p=16876
http://www.reddit.com/r/sysadmin/comments/2k7jn5/after_2_years_i_have_finally_solved_my_slow/
http://serverfault.com/questions/278860/why-are-my-hyperv-vms-randomly-losing-connectivity

http://blogs.emulex.com/implementer...r-vms-losing-network-connectivity-workaround/
https://www.linkedin.com/pub/mark-jones/8/230/511)
http://www.hyper-v.nu/archives/tag/vmq/
http://blogs.technet.com/b/scvmm/archive/2013/01/08/virtual-networking-in-vmm-2012-sp1.aspx

https://technet.microsoft.com/en-us/library/gg162704(v=ws.10).aspx

http://www.cc.gatech.edu/~lingliu/papers/2010/anwer-sigcomm-visa2010.pdf

http://up2v.nl/2014/06/16/hyper-v-2...ork-connections-be-carefull-with-emulex-nics/

http://www.hyper-v.nu/archives/hvre...-the-vmq-issue-with-emulex-and-hp/#more-24351

http://tweaks.com/windows/67075/increase-network-performance-of-hyperv-virtual-machines/

http://darrenmyher.com/2014/05/06/f...achine-performance-problem-on-server-2012-r2/

There is NOTHING in the network switches indicating an interface cycling, failure to negotiate duplex, etc.

ON a fiber channel switch I see some strangeness from many days ago, but NOTHING explaining our current Hyper-V errors.

2015 Mar 27 14:03:53 Cisco-3 %PORT-2-IF_DOWN_LINK_FAILURE: %$VSAN 3%$ Interface
fc1/10 is down (Link failure)
2015 Mar 27 14:05:52 Cisco-3 %PORT-2-IF_DOWN_LINK_FAILURE: %$VSAN 3%$ Interface
fc1/10 is down (Link failure)
2015 Mar 27 14:05:52 Cisco-3 %PORT-5-IF_UP: %$VSAN 3%$ Interface fc1/10 is up in
mode F
2015 Mar 27 14:09:56 Cisco-3 %PORT-4-IF_SFP_WARNING: Interface fc1/10, Low Rx Po
wer Warning
2015 Mar 27 14:09:56 Cisco-3 %PORT-3-IF_SFP_ALARM: Interface fc1/10, Low Rx Powe
r Alarm
2015 Mar 27 14:19:55 Cisco-3 %PORT-4-IF_SFP_WARNING: Interface fc1/10, Low Rx Po
wer Warning cleared
2015 Mar 27 14:19:55 Cisco-3 %PORT-3-IF_SFP_ALARM: Interface fc1/10, Low Rx Powe
r Alarm cleared

the strange thing is moving an impacted VM to a different Hyper-V host (i.e. via Live Migration) suddenly causes it to simply start working again as if nothing happened!

CHECK INTO VMQ

YeOldeStonecat · Apr 10, 2015

Hmmm.
Oddly enough...it's lately for us...like in the past month...my colleague had an issue a few months ago with another...but I hadn't experienced any with mine...I look at my colleagues and it was a smaller client with a smaller server with the default HP NICs of which the model was based on Broadcoms. I've generally avoided those, opting to install Intel NICs...so I silently said to myself "that's what you get for not going with Intel NICs".

But there I was yesterday with an issue on a Hyper-V host with Intel NICs and all that going wonky.
And this morning again with another client...although the native HP NICs.

It's like...DNS is simply failing to pass on the NIC.
I can ping via IP and get replies...
but ping via host name...nada....unable to.
Bounce DNS
Flush
Register
Change forwarders
Heck even manually punch in Googles or OpenDNS.

...the client I worked on this morning was actually the client of one of our other techs...his woman works there..and he's on the case now since he came in about 45 minutes ago....so dunno the status of it yet as I punted it over to him to take over once he walked in.

Maybe ISP related and not letting 53 traffic through. //shrugs.

But that server I had yesterday, on the Intel NICs....freaked me out..been running like a champ since I put it in last summer. Couldn't get traffic to pass outside the vswitch. VSwitch not settled in on the physical NICs.

Moltuae · Apr 10, 2015

Now that you mention it, yeah, I'm seeing similar strangeness too.

I haven't fully diagnosed the the issues I'm having yet, and I think the symptoms I'm seeing may be a little different to what you're seeing, Stonecat, so this may be something unrelated.

I thought I was looking at a local DNS/DC misconfiguration (and haven't yet ruled that out) since, in my case, the issue seems to go away if the VMs are configured to get their DNS data from the router or a public DNS server.

What I'm seeing is more like a NIC metric issue; almost like Windows doesn't know which network interface to use, even if there's only one. I can ping domain names and they correctly resolve to IP addresses, but browsing web pages (both internal and external) works intermittently. Yet, if I change the TCP/IP DNS configuration, web pages load instantly every time.

Thinking about it, these issues started about the same time my (TC/IP related) BSOD 139 issues started (the cause of which I'm still trying to pinpoint). Probably just coincidental ...

NETWizz · Apr 10, 2015

That is very strange indeed. Have you checked your firewall settings? I am NOT saying anyone in their right mind is going to block DNS or port 53, but I HAVE seen the Palo Alto firewlal blacklist a couple DNS servers due to DNS recursion falsely flagging them as the attacker of a DNS amplification attack.

All I am saying is strange stuff happens.

I would use NSLOOKUP

i.e. nslookup hostname (for forward lookup)
or nslookup ip_address (for reverse lookup)

nslookup hostname ipofdnsserver to verify to a different dns server.

From there perhaps use wireshark to capture the data coming across the network. Another option is to mirror a port then capture from that.

Make certain the FIRMWARE is up to date.

Confirm your DNS server IP addresses are set right and they are reachable. (i.e. if on a different subnet the gateway must be correct etc.)

YeOldeStonecat · Apr 17, 2015

Whelp NetWizz....gonna give your suggestion a shot.
Happened yesterday...AGAIN...new server. I had deployed it about a month ago, sat it down next to the old server. Had it running for about a week..and then I ran my guest installs....the 2012r2 Essentials...joined the domain, promoted...began the "migration" process taking over the roles..and then let her sit for over a week. Ran the SBS08 Exchange to O365 migration last weekend. Did a little more fiddling on the Essentials server..and then one night she went "offline"...plus the Hyper-V host. So went onsite yesterday, bounced her..came back. I built the second guest...the terminal server...went home to work on it some more..and it went offline AGAIN. Going onsite shortly...after reading your links and doing some more Google-Fu this morning...first thing I'm gonna do is what the VMQ.
As today...we move their MicroEdge LOB app from SBS to the new Essentials server...and I begin moving over their other "stuff" this weekend..shared drives, printers, folder redirects, etc. So....can't have her go offline again after today...and she's about to be put into production.

http://alexappleton.net/post/77116755157/hyper-v-virtual-machines-losing-network

Thanks again Netwizz.

YeOldeStonecat · Apr 17, 2015

Get onsite..by default VMQ was already disabled.

Gah.

NETWizz · Apr 17, 2015

Our system is still screwed up too... We bought a second, new Compellent SAN, an new Blade Chassis, and a couple Dell Fiber switches (by Brocade)...

I have the engineers in here from the start... I doubt the results will be any different, but we are still having all sorts of trouble. Personally, I would like our Government Agency to Dump Hyper-V completely in favor of VMware, but there is licencing, $$$ issues and everything else.

If this does not fix it, I think we need to have our on-staff Attournys look at our Enterprise Agreement with Microsoft and see if there is any way to back out gracefully; preferably with a refund.

Nerm · Apr 17, 2015

I have several Hyper-V deployments and so far I have not seen any issues like what you are describing and hopefully it stays that way.

Have you ran any packet captures to see what is happening to DNS specific traffic?

YeOldeStonecat · Apr 17, 2015

Going to change things...."spread the load". Never had issues with sharing a single NIC with a single vSwitch with management, and one or two servers on it...for the small setups. I just kicked in two more NICs (she was a 4 port onboard)...and putting management on all 3...and each of the two servers on their own NIC.
Each vSwitch has a hard coded IP address in the network range...subnet, gateway, first DNS is external/public DNS (Google)..and second DNS is the IP of the DC. I usually set them up that way...but trying to spread things...1x server per NIC/vSwitch now.

YeOldeStonecat · Apr 17, 2015

And that didn't do it....2/3 of the way done with the database copy with the MicroEdge people and the network fell asleep. GRRRR
Disabled TCP Offload on the guest NIC now...see if that helps.

YeOldeStonecat · Apr 18, 2015

Well, VMQ may have still been it. On the other unused NICs...it was still enabled. I disabled it..as well as kicked in 2x more vswitches.
So had 3x NICs..and 3x vswitches. I disabled the 4th NIC.
I moved the first VM guest to vswitch 2
I moved the second VM guest to vswitch 3.
Management enabled on all interfaces...and all alone on interface 1.

So far...so good (hate saying that, it will be cursed now).
Disabled TCP Offload on vswitch 2....the heavy Essentials server .
Noticed throughput across her was not very impressive though with VMQ disabled.

Anyone having issues with Hyper-V hosts lately? Lazy NICs 'n virtual switches?

YeOldeStonecat

Well-Known Member

NETWizz

Well-Known Member

YeOldeStonecat

Well-Known Member

Moltuae

Rest In Peace

NETWizz

Well-Known Member

YeOldeStonecat

Well-Known Member

YeOldeStonecat

Well-Known Member

NETWizz

Well-Known Member

Nerm

Member

YeOldeStonecat

Well-Known Member

YeOldeStonecat

Well-Known Member

YeOldeStonecat

Well-Known Member

Similar threads