vKoolAid: 2009

Tuesday, November 17, 2009

vSphere performance overview charts working and printing

I was getting the IE "Navigation to the webpage was canceled" message for the performance overview window after upgrading to vSphere.
Fix was to verify (and change) the URL to a universally resolvable DNS
record in the file:
C:\Program Files\VMware\Infrastructure\VirtualCenter
Server\extensions\com.vmware.vim.stats.report\extension.xml
(thanks to: http://communities.vmware.com/thread/233471).

Next was to see that no, there was no print button. BUT! You can use IE's print function that happens to be available in a few small whitespace areas. Here's one:

The second chart down is Memory (MBps). Right click in the empty whitespace between this graph and the left side of the frame. One of your IE options is print, another is print preview. There may be a spot on the first graph or other areas of the page, but this is a sure spot. Total hack, brought to you by me without google's help.

Thursday, October 1, 2009

vmware vMA 4

It would be nice if vMA would work with virtualcenter 2.5. I was sad
to get an error message today. Looks like direct ESX 3.5 targets (even
if they are managed by a 2.5 virtual center) works.

Thursday, July 30, 2009

VMWare Rap

The obvious winner in a contest (that's still running):
http://www.youtube.com/watch?v=zBQgfWdrUCs

The vid reminded a co-worker of this one (save yourself and click 7 minutes in):
http://video.google.com/videoplay?docid=4915875929930836239

Thursday, July 16, 2009

VCP4 beta

Possible time 4hrs, 15 minutes. Time taken: 3hrs 25 minutes with no
review. 270qs
Things I didn't think would be covered in great detail: NFS, SANboot,
DataRec, maximums
Things that were missing: vZones, other add-ins coming soon (chargeback,etc)

Surprisingly few really stupid questions "On such-an-such screen, what
is your configured ability for xyz? 1/2/4/8?"

Thursday, May 21, 2009

Doh! I can't delete my datastore!

This is for ESX 3.5, not sure if it will work for ESX4.

run on one host:
ls -l /vmfs/volumes
find the datastore ID (by simlink name)
Run on all hosts:
vmware-cmd -l | grep datastoreID

In my case, it was a template that still had reference to the datastore.

There may be a more elegant solution using RCLI so you don't need to
run the command on all your hosts, but this was quick and dirty.

Wednesday, March 4, 2009

Old Dell hardware fluke

raises it's ugly head. Updating the PERC controller on a 2850 makes
the local vmfs volume inaccessible, even on 3.5 u3. Luckily the
commands for the workaround from 3.0 are still valid. KB Article
1001577 : http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001577

Now I thought I updated to the latest PERC firmware..back to the
software update repos...

Tuesday, February 10, 2009

stupid easy PS vm disk info

There are lots of fancy ways to get esx disk info from powershell.
Some that I found gave "interestin" results so this is my KISS
version:
Get-VM | Get-Harddisk | export-csv c:\temp\disk.csv

I take out the junk I don't need and then do a data -> text to columns
to parse out the [datastore] vm_name\vmdiskname.vmdk

raw data good.

"By Design"

To follow up on stumped stumped stumped. There is definitely some funny memory management going on behind the scenes that doesn't show up in the balloon driver or shared memory stats. Here's some email thread, the top is a much smarter colleague giving his guess to the behavior:
----
I have to say that I don't quite get it. Without balloon driver activity, there is no induced memory pressure on a guest. It would have to be...

Ah ha! (maybe). What they're (barely) saying seems to require dynamic behavior from the active memory algorithm. We've talked about how the definition of the "working set" cannot be precisely defined. The active flag for a page could be tied to a decision threshold proportional to the real memory versus granted memory ratio or something similar. So if you populate a 16GB server with only a 1GB guest, any page that was ever used will remain active. As you start adding guests (or allocating guest memory), the threshold lowers, and some pages that were counted as active now expire based on some ranking attribute involving age, frequency, and/or patterns of past use. And all this would happen even if there was memory to spare, because the hypervisor starts preparing for heavier use preemptively.
Good theory?

This isn't what you describe but could explain some of the observed behavior.

Are you saying that newer processors with hardware-assisted VT
http://communities.vmware.com/docs/DOC-9150 don't perform page sharing? Or just that page sharing doesn't really start happening until the guest ratio kicks up? Or that it depends on something we don't understand yet? [[ Depends on something else or guest ratio...
]]

I love this quote from doc 9150: "However, TLB misses are much more expensive in a nested paging environment[*], so workloads that over-subscribe the TLB are potentially still good candidates for binary translation without hardware assistance."

[*] from the AMD-V + RVI feature set which the AMD Opteron 2300s in our r805 have, enabled by "monitor.virtual_mmu = "hardware":
http://www.amd.com/us-en/0,,3715_15781,00.html?redir=SWOP08 has a little more history about support in VMWare than I'd seen.

That is mysterious and ineffable. So hardware VT is sometimes bad but good luck figuring out when. Down the rabbit hole we go.

-A
-----
My email to VMWare support to close the case:

Melori (and other support),
It seems like this might be a wild goose chase. I've been trying to recreate the memory differences yesterday and today and can't do it. I de-populated one of the old servers and it's showing the same memory values on two different sets of cloned guests as the new (fairly empty) servers.
Looks like any server with a high guest ratio will work differently with memory than one with a low number of guests. I'm seeing a metaphor for shoving multiple pillows into a pillow sack or something.
They condense up without loosing their ability to work, but without page sharing (so it seems...the pages shared numbers didn't fluctuate).
So it does seem to be "by design" but more difficult for capacity prediction models to work with.

Monday, January 12, 2009

Stumped stumped stumped

I've been running around in circles for months now about how our linux
(RHEL5) VMs are gobbling up memory. We've got old(er) intel 64bit dual
dual-core with HT and new shiny AMD 64bit NUMA RVI dual quad-core
machines. Memory usage on the AMD boxen is through the roof, as in
80-90% utilized when left to it's own devices, while the intel stay
around 50-60%. RVI is supposed to save the world by dumping one layer
of memory virtualization. Virtualized MMU (hardware page table
virtualization) should 'just work'. The host should see roughly what
the guest sees.

So how come some guest cron jobs take most available (guest) physical
memory and then shove them into cache? The host still thinks the
memory is needed so it's not readily available for other guests. I ran
across the wonders of drop_caches and an echo 1 brings my physical
memory usage from 450MB to 90MB (and dumps the buffers / caches). The
host still shows the memory in use by the guest (esxtop, vc stats) but
it's not. no swapping, no idea what the hell is going on.

I would love to blame redhat and their shitty setup (come on,
bluetooth running on a minimal install?), but a debian box has a
similar issue. SELinux? Although not as horrible with SELinux off,
this is not the root.

But my real question with allathis is: Why the hell can't I find
anyone else with beef on how this is working? I have google kungfu
skillz and can't find anything that talks about this as a performance
problem (or not a problem).

yech.