Raistmer

Why kernel w/o scatter stream slower than kernel based on scatter stream?

Discussion created by Raistmer on Jul 20, 2009
Latest reply on Aug 10, 2009 by Raistmer
kernels listed - some comments, please

In attempts to speed up currently too slow my Brook+ code I trying different variants of kernels that do same tasks.

There was comment in another thread that performance degradation is big when scatter stream with more than 8192 elements is used.
So I rewrited kernel to avoid usage of scatter array at all.
Unfortunately, it seems this move only increased app CPU time (and elapsed time too).

What is wrong?

kernel void GPU_coadd_kernel54(float4 src[][],int size[],out float4 dest<>) { int threadID=instance().y; int i=instance().x; float4 o; float4 i2; float4 i21; int ln=(size[threadID]+3)>>2; if(i>=ln) return;//R:thread unneeded //for(;i<ln;i++){ i2=src[threadID][2*i]; i21=src[threadID][2*i+1]; o.x=i2.x+i2.y; o.y=i2.z+i2.w; o.z=i21.x+i21.y; o.w=i21.z+i21.w; dest=o; //} } kernel void GPU_coadd_kernel4(float4 src[][],int size[],out float4 dest[][]) { int threadID=instance().y; int i=0; float4 o; float4 i2; float4 i21; int ln=(size[threadID]+3)>>2; for(;i<ln;i++){ i2=src[threadID][2*i]; i21=src[threadID][2*i+1]; o.x=i2.x+i2.y; o.y=i2.z+i2.w; o.z=i21.x+i21.y; o.w=i21.z+i21.w; dest[threadID][i]=o; } }

Outcomes