cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

danielp
Journeyman III

Multiple prefetch sample form AMD64 Opt Manual

In the Soft Opt Guide for AMD64 Processors (on page 107), there is a sample for multiple prefetch where the loop starts with:

prefetchw [eax+256] ; Four cache lines ahead
prefetch [ebx+256] ; Four cache lines ahead
prefetch [ecx+256] ; Four cache lines ahead

But the values of eax, ebx and ecx don't change inside the loop. Is this an error or this is how we are supposed to use it?
0 Likes
7 Replies
avk
Adept III

danielp: Of course, this is an error in the document. Somebody at AMD has forgotten to insert into the square brackets of PREFETCH a biased index EDX with appropriate scale coefficient and an offset.
0 Likes

Thanks for the reply.

That's what I thought. There is only one small problem. The original code (with this "error") runs faster than the one with the corrected offset.
0 Likes
avk
Adept III

Well, let's try to solve that small problem . Please describe there exactly those prefetch parameters and CPU you use.
0 Likes

prefetch(w) [e(a|b|c)x+edx*8+ARR_SIZE+(64|128|192|256)]

amd64 3800

timed using agner fog "testing framework"
(http://www.agner.org/optimize)
0 Likes
avk
Adept III

Argh... That code sample works even faster without software prefetch at all! I think we have a hardware prefetch in this case. That loop is well predictable by hardware, so I think there is no need in software prefetch. Did you try to run that code sample without a prefetch instructions?
0 Likes

Yep, it appears to be the case. It runs faster without the prefetch instructions.

The doc (AMD64 Opt Manual, page 105) has to say do this:

"In some cases, using prefetch instructions on processors with hardware prefetching may slightly reduce performance. In these cases, it may be necessary to remove the prefetch instructions. All current AMD Athlon 64 and AMD Opteron processors have hardware prefetching mechanisms."

Another mystery explained.

Thanks
0 Likes
avk
Adept III

Well, it seems that in the AMD documents #25112 and #40546 we see unsuccessful example of using software prefetch. Nevertheless, the good result is achievable. All we need is an experimentation and some time (and money ) to do it.
0 Likes