Tuesday, February 10, 2009

"By Design"

To follow up on stumped stumped stumped. There is definitely some funny memory management going on behind the scenes that doesn't show up in the balloon driver or shared memory stats. Here's some email thread, the top is a much smarter colleague giving his guess to the behavior:
I have to say that I don't quite get it. Without balloon driver activity, there is no induced memory pressure on a guest. It would have to be...

Ah ha! (maybe). What they're (barely) saying seems to require dynamic behavior from the active memory algorithm. We've talked about how the definition of the "working set" cannot be precisely defined. The active flag for a page could be tied to a decision threshold proportional to the real memory versus granted memory ratio or something similar. So if you populate a 16GB server with only a 1GB guest, any page that was ever used will remain active. As you start adding guests (or allocating guest memory), the threshold lowers, and some pages that were counted as active now expire based on some ranking attribute involving age, frequency, and/or patterns of past use. And all this would happen even if there was memory to spare, because the hypervisor starts preparing for heavier use preemptively.
Good theory?

This isn't what you describe but could explain some of the observed behavior.

Are you saying that newer processors with hardware-assisted VT
http://communities.vmware.com/docs/DOC-9150 don't perform page sharing? Or just that page sharing doesn't really start happening until the guest ratio kicks up? Or that it depends on something we don't understand yet? [[ Depends on something else or guest ratio...

I love this quote from doc 9150: "However, TLB misses are much more expensive in a nested paging environment[*], so workloads that over-subscribe the TLB are potentially still good candidates for binary translation without hardware assistance."

[*] from the AMD-V + RVI feature set which the AMD Opteron 2300s in our r805 have, enabled by "monitor.virtual_mmu = "hardware":
http://www.amd.com/us-en/0,,3715_15781,00.html?redir=SWOP08 has a little more history about support in VMWare than I'd seen.

That is mysterious and ineffable. So hardware VT is sometimes bad but good luck figuring out when. Down the rabbit hole we go.

My email to VMWare support to close the case:

Melori (and other support),
It seems like this might be a wild goose chase. I've been trying to recreate the memory differences yesterday and today and can't do it. I de-populated one of the old servers and it's showing the same memory values on two different sets of cloned guests as the new (fairly empty) servers.
Looks like any server with a high guest ratio will work differently with memory than one with a low number of guests. I'm seeing a metaphor for shoving multiple pillows into a pillow sack or something.
They condense up without loosing their ability to work, but without page sharing (so it seems...the pages shared numbers didn't fluctuate).
So it does seem to be "by design" but more difficult for capacity prediction models to work with.


  1. "By default, ESX automatically runs 32bit VMs (Mail, File, and Standby) with BT [ (i.e., no hardware virtualization features)], and runs 64bit VMS (Database, Web, and Java) with AMD-V + RVI."


  2. To add, here's a communities thread that talks about why most 32bit VMs don't have hardware MMU by default. http://communities.vmware.com/thread/194219

    Funny that this all seems to be coming out now after months of not knowing what the hell was going on.