cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Reduced cache hit when I put a piece of code under loop!!

Hi,

Card:       7970

Catalyst:  13.4

APP    :    2.8

OS      :    Kubuntu 12.04 x64

Code snippet:

//for (i =0 ; i  < 25 ; i++)

   encrypt();

When I comment the the loop the cache hit (tested using codeXL 1.1)  is 99%. But as soon as I un comment it cache hit drops to 23% and the kernel execution time is increased by 50 times when it should increase only by 25 times.  The function encrypt() is quite large to fit into i-cache but still when there is no loop cache hit is 99%. But as soon as I increase the no iterations i.e anything more than 1 iteration the cache hit wil drop to 23% and the performance penalty is 2x times , where x is the number of iterations. 

Regards,

Sayantan

0 Likes
1 Solution

This code is not going to run well because it’s too large for the instruction cache.  Even the code without the loop take 35072 bytes, which is too large.  Combine that with the fact that we can only get 4 waves per CU, due to the VGPR usage, and we can’t hide the latency of the I$ fetches.  Perhaps with the user’s particular driver, the code without the loop fits in the I$, but with the driver I am testing both kernels are too large for the I$.


The developer should also be aware that, as far as I can see, adding the loops does nothing to the algorithm since the first thing done in encrypt() is to set out[] to in[] which undoes all the previous computations.  The compiler can’t see this.


Courtesy: Jeff Golds

View solution in original post

0 Likes
8 Replies
himanshu_gautam
Grandmaster

Called once, the function encrypt() might just get inlined in the kernel. There may be several optimization, that may reduce the variables needed, resulting in high performance. Multiple iterations of a big function is highly unlikely to get inlined. Which would require lot of variable fetching, and stack management.

Anyways it is interesting, and I will seek some experts advice

Can you try checking the performance once again with using "-cl-opt-disable" flag for compiling the kernel?

"-cl-opt-disable" flag does improve the performance around 5-6% when the function is looped but definitely not enough to eliminate 50% performance loss due to cache hit.

Thanks for the reply.

Regards,

Sayantan

0 Likes

Can you improve the code a bit. I was a getting a lot of errors because of the goto statement. Also you seem to be running the kernel only once. For profiling purpose, run it over say 100 iterations, and average out.

0 Likes

Generally goto statements produce only warnings which are mostly harmless. Also the kernel I have attached doesn't have any goto statement.  Also the compiler seems to auto inline all the functions which I really don't want to.  Is there any possible way to reduce the code length (ISA length) ?

Regards,

Sayantan

0 Likes

goto is not that big a problem. But number of iterations is. Can you report your results after running the kernel for multiple iterations. Cache-hit counter might be buggy (and in that case, the issue should go to CodeXL team), but we need to make sure that performance is indeed going worse. In that case, it becomes a OpenCL Compiler/runtime issue.

0 Likes

Hi ,

CodeXL seems to be reporting correctly because when the cache hit drops it is accompanied by an increase in fetch size and mem unit busy which can be only explained by increased cache misses.  I also tested the kernel 10 times inside a loop on the the host side and performance counters were almost identical for each kernel call. This seems to be a compiler problem to me.

Regards,

Sayantan

0 Likes

This code is not going to run well because it’s too large for the instruction cache.  Even the code without the loop take 35072 bytes, which is too large.  Combine that with the fact that we can only get 4 waves per CU, due to the VGPR usage, and we can’t hide the latency of the I$ fetches.  Perhaps with the user’s particular driver, the code without the loop fits in the I$, but with the driver I am testing both kernels are too large for the I$.


The developer should also be aware that, as far as I can see, adding the loops does nothing to the algorithm since the first thing done in encrypt() is to set out[] to in[] which undoes all the previous computations.  The compiler can’t see this.


Courtesy: Jeff Golds

0 Likes

Yes I'm aware of the fact that loop does nothing. But this was just a demo kernel , in actual scenario there will be links between the iterations.

Regards,

Sayantan

0 Likes