Archives Discussions

LGS · ‎04-10-2008

I am trying to improve the performance of my app. Looking with CodeAnalyst, I see that the biggest time used in my app comes in a hash table lookup. This hash table is relatively large (1.7 gig on a 2 gig system), so it is not surprising that much time would be spent here.

What is surprising is what happened when I added a prefetchnta call in an attempt to ensure the correct hash node would be available when needed. While the "cmp reg/mem" instruction for checking the hash node now takes much less time, most of that time is now showing up in the prefetchnta instruction.

Isn't prefetchnta essentially an asynchronous operation? I expected that the prefetch would continure to run while I performed other operations, not block until the memory is retrieved. But CodeAnalyst shows a huge hunk of time being spent there.

I'm on an AMD Turion64 x2.

pdrongowski · ‎04-11-2008

Hi LGS --

That indeed is quite a large hash table! If the access pattern into the hash table is random, then the program would need to start access to the next entry (via prefetch) while working on the current entry. The amount of benefit from prefetch depends upon the amount of useful overlap between the prefetch and work to be done. With a hash table that size, it's possible that the program is adversely affected by DTLB misses -- there are only so many pages that a program can touch before a page entry must be loaded into the DTLBs. This penalty can be substantial. Page faults could also be an issue.

Due to a sampling phenomenon called skid, time- and event-based samples do not fall on the instruction that actually causes the delay or event. The samples usually fall within the same neighborhood (code region.) With parallel, out-of-order execution, this neighborhood can be as big as 72 instructions.

Hope this helps you to make more progress -- pj

LGS · ‎04-12-2008

Hey pj, thanks for the response.

I have been busy moving as much work as I can in between the prefetch and the access of the hash table entry. It has been helping, but I'm running out of things to move. I have already put the prefetch call at the earliest moment when I know what entry I'm going to need, and proceeding with the next operation isn't really viable here. If there is a hash hit (which there is ~73% of the time), that drastically changes what happens next.

I don't believe page faults are the issue. Using task manager shows (essentailly) zero page faults from my app. I suppose it could be DTLBs, but since I have no idea what that is, I'm going to have to do some research before I can comment on it. Always something new to learn.

As for skid causing CA to misreport, yes, I understand. It was a bit confusing at first to see huge numbers reported for register-to-register moves, but I realized it had to be something like that (although I didn't know the term).

However, I don't believe that is the problem here. Without using prefetch I get:

Address Line Trace Source Code Bytes Timer samples
0x140001871 cmp r9,[rdx+rcx*8] 4C 3B 0C CA
0x140001875 jnz $+0000027ch (0x140001af1) 0F 85 76 02 00 00 173325

That cmp is checking to see if the hash table entry is REALLY the value I want. With using prefetch I get:

Address Line Trace Source Code Bytes Timer samples
0x14000154a prefetchnta [rdi+rdx*8] 0F 18 04 D7 1029
0x14000154e mov al,sil 40 8A C6 93921

Address Line Trace Source Code Bytes Timer samples
0x140001636 cmp r9,[rdx+rcx*8] 4C 3B 0C CA
0x14000163a jnz $+00000294h (0x1400018ce) 0F 85 8E 02 00 00 74542

This is out of a run with 305,996 timer samples. When you are dealing with numbers that are such a significant percentage of the overall total, they tend to stand out a bit. The 93,921 (or anything like it) wasn't there until I added the prefetch call. My next largest was on the order of 10,000. I'll give this more thought (maybe it's something about calculating the hash table entry?), but I'm not convinced.

Do you have any other insight you can offer into how prefetch works? Based on my limited understanding, even if there WERE page faults being caused, that should cause delays on the cmp instruction, not the prefetch. prefetch merely initiates the retrieval of the cache line. Or am I misunderstanding how all this is supposed to go?

LGS · ‎04-14-2008

Ok, I've done some reading on DTLBs. While I always knew there had to be some mapping between address spaces and physical RAM, I guess I assumed it was done thru some deep, dark secret inside the processor itself. Learned something new today.

So, presumably 1.8 gig of memory is going to have a LOT of DTLBs (1 per 4k page?). Enough to challenge my L1/L2 even without my millions of hash table entries pushing them out. Could this be the source of the delays on the prefetchnta instruction? Maybe Prefetch is async *after* it reads the DTLB to discover which (physical) cache line to retrieve. And since the DTLB frequently needs to be retrieved from RAM, there's a measurable delay on the prefetch instruction. Does that sound right?

The only way I found to shrink the size of the DTLB table is to use what Windows calls LargePages. However, the largest chunk I could allocate with VirtualAlloc(MEM_LARGE_PAGES) was only ~58 meg (apparently due to fragmentation?). And I could only do 1 allocation. Since I need ~1.8 GIG, this seems like a non-starter.

This leaves me with 2 things I can think to try:

1) Of the 19 bytes in my hash table entry, 1 of them gets written to during each hash read (whether it is a hit or a miss). I'm considering moving that to a separate 95M allocation (1 byte per hash table entry) and doing a separate PREFETCHW to read that data. I have some vague idea that this approach might produce fewer dirty cache lines. On the other hand, if my theory about what's causing the prefetch delays is correct, this will probably actually make things worse.

2) Most of my hash table entries get completely replaced pretty frequently. As part of replacing them, I suspect the new entry is going to end up passing (pointlessly) thru the L2 cache (pushing out more valuable data). My experiments to address this with MOVNTQ haven't yet been successful. It seems the write time involved in going straight to memory is so huge, it dwarves any benefit from a reduction in cache thrashing. Is there a better way to approach this?

Sometimes during optimization, you reach a point where the only way to continue to improve is with a re-design. Since ~50% of my time is being spent on these two instructions, I may already be there. So I guess what I'm looking for at this point is:

a) Is my understand of the bottleneck I've run into (essentially) correct?
b) Are there any other obvious things I can try to improve cache performance?

LGS · ‎04-15-2008

I'm using 64 bit vista ultimate.

Thanks for the link to 7max. I'm looking at it, but since the last release date was in 2005, I'm guessing they haven't done much vista testing. Also, while I haven't delved deeply into the code, it looks like they may just be hooking memory allocation calls and replacing them with VirtualAlloc(MEM_LARGE_PAGES). As I mentioned, that call doesn't seem to help much on my system. I've posted a message in their forum (I'm the first person to do so in 2008). We'll see what happens.

Optimizing this has been a learning experience, that's for sure. Thanks for your help.

avk · ‎04-15-2008

LGS: I suppose that you are using Win64? If yes, did you try 7-max utility? It can be useful for your needs (I hope), because it can provide a large memory pages for your application.

About 2 things you can think to try:
1) If I was you, I would try to do that

.
2) The MOVNT* instructions are only good in special cases which are described in the AMD Optimization Manual (I mean 'Block Prefetch Technique').

Well, I think that it is very difficult to optimize a huge lookup table with random access. Good luck

!

pdrongowski · ‎04-15-2008

Hi LGS --

Your analysis seems valid. A prefetch operation out to memory certainly requires the physical address which requires DTLB access. The Software Optimization Guide recommends prefetch for loops that stride through arrays. Caches and TLBs favor
programs with regular stride, sequential access. Is there a way to change your overall algorithm to eliminate the hash table
approach? (I acknowledge that would probably be a major rewrite of your app! 😞

One way to determine if DTLB misses are a major issue is to use event-based profiling and measure the DTLB misses. The
"Investigation data access" profiling configuration measures the relevant events. The two DTLB miss events are:
L1 DTLB miss and L2 DTLB hit
L1 DTLB miss and L2 DTLB miss
There are two events since the data TLB is organized into two levels. I suspect that due to the huge size of the hash table,
there are a large number of the second kind of DTLB miss which misses both levels. These are quite expensive since the
Page Table Walker is invoked to retrieve the page mapping information from L2 cache or memory -- probably memory in
this case.

Interesting problem!

-- pj

P.S. Using large pages may help a little, but 1.7 GB is a huge footprint... Can
hash table / data be partitioned in some way?

LGS · ‎04-16-2008

The key bit I needed to understand here seems to be the fact that the address space->physical lookup table is stored in memory, and is subject to the same L1/L2/RAM restraints as any other type of memory allocation. "Misses" here will affect the performance of prefetchnta. How to resolve it is going to take some thought.

I've looked at the data access numbers before. The problem is knowing whether a given number is "good" or "bad." What constitutes a "large" number?

Over a 300 second run (after a 1 minute warmup), I'm seeing:

DC Accesses: 326,574
DC Misses: 34,866
DTLB L1M L2H: 7,489
DTLB L1M L2M: 22,446
Misalign accesses:200,655
Ret Inst: 1,532,894
DC Refills L2/NB: 34,843

Does this confirm our theory?

While removing the hash table is possible (actually a single #define), the overall thruput of the app drops drastically without it. As much trouble as it causes, having it is better than not having it.

Partitioning it may prove more practical. The processor in this box is dual core. For the moment, I am locking the affinity to a single core, which means I've got an entire L1/L2 cache almost entirely unused (well, except for that pesky OS).

My last attempt to split my app into 2 threads was based on doing the hash table additions/lookups on one core, and everything else on the other core. Unfortunately, my best effort here ended up cutting the performance in half.

It appeared the problem was due to communicating work between the two threads.

Traditional methods (WaitForSingleObject) are ill-suited to ~3,000,000 finds and ~1,000,000 adds each second. My best results came from spinning on memory locations with CmpXchg instructions. But obviously not good enough.

I did some googling at the time on the best way to do this type of communication, and my approach seemed to be typical. I may try this again with my new understanding of DTLBs and see if I can do better. Unless you know of any AMD-specific tricks or whitepapers, that type of question may be more appropriate for a general programming forum.

Any insight on the numbers above would be appreciated.

Archives Discussions

Another prefetch question