Help interpreting ESXTOP numbers

HCHTech · Feb 26, 2020

I'm troubleshooting some performance issues on an aging ESXi host and need some help figuring out what ESXTOP is trying to tell me. Anyone here comfortable enough with this to take a crack at it? I'll save the screencaps for an affirmative response.

TIA!

Sky-Knight · Feb 26, 2020

Leave it to VMWare to rename an ancient and well loved *nix tool... But they did add disk and NIC performance stats to it... so there's that.

What about it is confusing? The only thing I can think off that might get wonky if you're used to windows is the CPU "load", because that's not a percentage. It's LOAD! A load of 1 means there's 1 process waiting for CPU time, so if the machine has four cores, and a load of one that means 5 processes are trying to run. Now a load of 10 on a quad core indicates a problem, because we've got 4 in process and 10 waiting in line... CPU is overwhelmed. But a load of 10 on a 24 core system just means the CPUs are a bit busy.

Load is a bucket more accurate than a CPU percentage meter, but it does require some understanding to interpret.

YeOldeStonecat · Feb 26, 2020

Post those screenies....
And what's the end user complaints focusing on?

HCHTech · Feb 26, 2020

The complaints are just slowness. They are only coming from the users of the "R1" and "R3-Win10" machines, explained below. Obviously, this is dependent on load, so after-hours or over a weekend when I look and test, everything works fine for me. So these captures are taken in the middle of a workday.

This is a Poweredge R720 with 2xXeon E5-2670s = 20 physical cores, 128GB RAM & 2 disk arrays, a RAID 10 of spinners, 10TB Capacity, and a RAID 1 of SSDs, 1TB capacity.

The following VMs are on the array of spinners:

SB-SERVER2 = Server 2012 Domain Controller
APP-SERVER2 = Server 2012 Application Server
CompuVM = Win10 machine housing the Ubiquiti controller & similar
vCenter65 = management interface
MetroXP = WinXP workstation for accessing historical LOB app - normally powered off
PowerChute = small linux machine for managing the dual UPSes.

The following VMs are on the array of SSDs:

R1 = Win10 workstation currently in use by an employee at a satellite office
R1-Win10 = Win10 workstation powered on, but no current user
R2-Win10 = Win10 workstation powered on, but with no current user
R3-Win10 = Win10 workstation currently in use by a 2nd employee at a satellite office

Both of the "idle" workstation vms are intended to other users at the satellite office (who are currently remoting into physical machines), but I don't want to proceed with this until I solve the performance issues, or deem them unsolvable.

Ok, here are the caps. For the CPU and Memory screens, I have included both "full" displays and "VM only" displays.

Obvious observations: The CPU "%Wait" column seems like the canary in this coalmine, but the "%RDY" doesn't look like I would expect if processing power were in short supply. There has been no over-allocation of resources.

I would also expect that the disk array of spinners would be a clear bottleneck, but the disk screens don't seem to bear that out...

CPU Full Screen:

CPU - VMs Only

Memory, Full Screen:

Memory, VMs Only

Disk Adapter:

Disk Device:

Disk - VM

Network:

Power Management:

Sky-Knight · Feb 26, 2020

DQLEN of 192? You're disk IO bound... CPU and RAM are fine, the disks aren't keeping up.

HCHTech · Feb 26, 2020

Sky-Knight said:
DQLEN of 192? You're disk IO bound... CPU and RAM are fine, the disks aren't keeping up.

Well, at least that is where I thought the problem would lie, I just didn't know how to interpret that number. What is a "normal" DQLEN value, then? Or at least one that doesn't cause you to suspect it's a problem...

The other disk listed there has a DQLEN value of 64, is that also suspect? The last one listed is the SD-card RAID-1 that the host OS runs on, but I'm ashamed to admit I'm not sure what the first entry in that screen is even referring to...the one that starts with mpx. It's showing a queue length of 31.

Also, I need to confirm which array the 192 DQLEN value is for. I'll have to check if the device name string is listed somewhere in the management interface to figure that out. The complaints are coming from users whose workstation VMs are on the SSD array, There are 45 physical workstations in the place using the DC & AppServer that are not complaining at all. Maybe the bottleneck is the RAID card?

Sky-Knight · Feb 26, 2020

SQLEN is Disk Queue Length...

Windows has a similar metric, it's how many requests to the filesystem the volume in question is behind.

But yeah, you've got two drives there that need help, one is a bit slow the other is crying in a corner from the beating.

HCHTech · Feb 26, 2020

Sky-Knight said:
DQLEN is Disk Queue Length...

FTFY. both of those entries are drive arrays, not single drives. I'll comb through the management interface to see if I can confirm which array has the 192 and which has the 64...and what even the heck the first line means, since I don't have any other disk arrays. I also notice on the disk adapter screen that everything is on one channel on the RAID card? I think? Wow, I would have done that differently had I been the one to build it.

Also, I have more reading to do. Here is a short explanation of those values from some random esxi dude on the internet (so it has to be right, right?):

DQLEN–this is the configured queue depth limit for the datastore. This value is identified by looking at the configured HBA queue depth limit, which is generally 32 (QLogic FC is the exception at 64, or Software iSCSI which is 128). If there is more than one VM on the datastore then this value is the minimum of the HBA device queue depth OR the Disk.SchedNumReqOutstanding (which is a per-device setting, which defaults to 32). Whichever is smaller.
ACTV–this is the number of slots currently in use by the workload going to the datastore. This value will never exceed DQLEN
QUED–this value is populated if the workload exceeds what DQLEN allows. If ACTV = DQLEN, anything over and beyond that will be queued to wait to be sent to the device. This shows how many I/Os are currently waiting to be sent. When this value is above zero, you are going to start seeing latency in the guest. This latency will not be reflected on the FlashArray! The timer for latency on the FlashArray starts as soon as it is submitted, so in other words, the timer on the FlashArray starts when the I/O enters the ACTV state. If you see latency difference between the array and in ESXi, it is usually because I/Os are building in QUED. If there is a no difference between the ESXi latency and the FlashArray, but a difference between those two and the virtual machine, you are queuing in the virtual machine.

So...DQLEN is the CONFIGURED queue length, not the ACTIVE queue length. At least as I read it.

Sky-Knight · Feb 27, 2020

Your numbers seem to indicate that too, as the queue numbers are high but the percentages are low...

So if that's just a configuration we're back to WTF is going on, because none of these load numbers seem particularly bad.

Markverhyden · Feb 27, 2020

If it's just happening to a subset of the VM's I'd lean towards a problem with those instances, not an issue with the underlying platform. R1 and R3 are both remotes so one thing I'd look at is networking. But first I'd try to get them to be more specific on the symptoms. What and when is slow? Are both R1 and R3 at the same remote office? Also have you confirmed the underlying VM configs are all identical? Have them create a simple log of problems to include date and time. I like to run a continuous ping from remote to destination for a while to check for latency. Just make a script for them to launch and pipe the output to a file. Depending on the situation I'll do both IP's and FQDN's if needed.

Sky-Knight · Feb 27, 2020

Are R1 and R3 at the SAME satellite location? Because if they are... that seems to indicate an ISP issue there honestly. RDP HATES intermittent frame loss, turns into random 1-3 second lag clicking on stuff and generally drives users up a wall.

HCHTech · Feb 27, 2020

Just to clarify, both workstation vms are being accessed by folks at the same satellite location. There are two other employees there as well, but they access physical machines at the main office and do NOT complain of slowness. The two sites are connected by a full-time VPN tunnel and everyone is just using RDP.

So....because the folks RDP-ing into physical machines don't complain, I'm guessing the culprit is more likely to be the VMs themselves (or the host) than the internet connection. There is so much I don't know about the values displayed by ESXTOP (and how they interplay) that I'm still far from confident the answer isn't right there staring me in the face.

All four of the workstation VMs have exactly the same resource allocations, and all four are on that same SSD array. When I first started poking around on this problem a couple of weeks ago, I found that those VMs were inadvertently configured to have their swap files located on the spinning disk array - I guess that's the default. I changed that configuration for all 4 machines to put the swap file on the SSD array and was sure that would solve the problem, but the reports I'm getting was that it didn't help at all.

One thing I could try is setting up one of the complainers on one of the currently-unused workstation VMs to see if it is any better than the VM they are currently using.

I'm going to start getting some real numbers to the complaints. I'll develop a list of tasks and clock them on different VMs at different times of day and after-hours. That will give me some real data to work with at least...

Help interpreting ESXTOP numbers

HCHTech

Well-Known Member

Sky-Knight

Well-Known Member

YeOldeStonecat

Well-Known Member

HCHTech

Well-Known Member

Sky-Knight

Well-Known Member

HCHTech

Well-Known Member

Sky-Knight

Well-Known Member

HCHTech

Well-Known Member

Sky-Knight

Well-Known Member

Markverhyden

Well-Known Member

Sky-Knight

Well-Known Member

HCHTech

Well-Known Member

Similar threads