The key bit I needed to understand here seems to be the fact that the address space->physical lookup table is stored in memory, and is subject to the same L1/L2/RAM restraints as any other type of memory allocation. "Misses" here will affect the performance of prefetchnta. How to resolve it is going to take some thought.
I've looked at the data access numbers before. The problem is knowing whether a given number is "good" or "bad." What constitutes a "large" number?
Over a 300 second run (after a 1 minute warmup), I'm seeing:
DC Accesses: 326,574
DC Misses: 34,866
DTLB L1M L2H: 7,489
DTLB L1M L2M: 22,446
Misalign accesses:200,655
Ret Inst: 1,532,894
DC Refills L2/NB: 34,843
Does this confirm our theory?
While removing the hash table is possible (actually a single #define), the overall thruput of the app drops drastically without it. As much trouble as it causes, having it is better than not having it.
Partitioning it may prove more practical. The processor in this box is dual core. For the moment, I am locking the affinity to a single core, which means I've got an entire L1/L2 cache almost entirely unused (well, except for that pesky OS).
My last attempt to split my app into 2 threads was based on doing the hash table additions/lookups on one core, and everything else on the other core. Unfortunately, my best effort here ended up cutting the performance in half.
It appeared the problem was due to communicating work between the two threads.
Traditional methods (WaitForSingleObject) are ill-suited to ~3,000,000 finds and ~1,000,000 adds each second. My best results came from spinning on memory locations with CmpXchg instructions. But obviously not good enough.
I did some googling at the time on the best way to do this type of communication, and my approach seemed to be typical. I may try this again with my new understanding of DTLBs and see if I can do better. Unless you know of any AMD-specific tricks or whitepapers, that type of question may be more appropriate for a general programming forum.
Any insight on the numbers above would be appreciated.