Change in work per workitem doesn't change amount of fetches/writes

Discussion created by Raistmer on Sep 22, 2010
Latest reply on Oct 4, 2010 by himanshu.gautam
Why it can be?

I have some big long kernel with rather small execution domain.
When I do some of inner loop unrolls and use more workitems with smaller load on each performance greatly increases.
But each workitem requires it's own memory starage space. One buffer limited in size by 128MB. So I tried to use 2 temporary buffers 128MB each to increase unroll factor.
But in this case performance not increased, it hugely decreased instread.
Looking into profiler data I see that number of read/write instructions per workitem remained almost the same as before (when number of workitems was 2 times lower and each performed 2 times more work).
I use listed code to alternate between memory buffers. But it looks like each thread passes both ways. Can it be ? Why number of fetch and write instructions doesn't drop in 2 times ?

EDIT: I use third dimension in execution domain for unroll. It was 32 with single temp buffer and 64 with 2 temp buffers.

......... int z=get_global_id(2); int TOffset2 = (32*y+z) * AdvanceBy; if(z>=32){ TOffset2 = (32*y+(z-32)) * AdvanceBy; } .......... __global float4* tmp_pot = tmp_PoT + ul_PoT + TOffset2 * (fft_len4); if(z>=32){//R: other half will use secondary bufer. It';s because of 128MB limit per buffer in current OpenCL ATi implementation tmp_pot=tmp2 + ul_PoT + TOffset2 * (fft_len4); }