Enqueue Kernel performance

Discussion created by kb1vc on Feb 16, 2010
Latest reply on Apr 25, 2011 by jdeguara

sorry if this is a duplicate post, the posting widget crashed during my first attempt and I don't know what happened to the post.

I was trying to duplicate the offsetCopy experiment from the nVidia best practices guide. The result was pretty poor bandwidth reports for a simple copy kernel:

__kernel void offsetCopy(__global float *odata,
__global float* idata,
int offset)
int xid = get_global_id(0) + offset;
odata[xid] = idata[xid];

I was getting about 3GB/s on my RV730 in an i7 (nehalem ws) fedora 11 linux box running on the GPU.

So, I timed the overhead for firing up an empty kernel with no args.

char * empty_kernel_source =
"__kernel void empty_kernel() \
{ \

And this is what I got from my test program:

[reilly@i7 launchtest]$ ./CLFW_launchtest
Local work group size = 32 num work items = 33554432
Kernel execution time: trial = 0  44.750576 ms  44750576  7.4981e+08 work_items / sec
Kernel execution time: trial = 1  44.743816 ms  44743814  7.49923e+08 work_items / sec
Kernel execution time: trial = 2  44.743969 ms  44743966  7.49921e+08 work_items / sec
Kernel execution time: trial = 3  44.743862 ms  44743862  7.49923e+08 work_items / sec

About 750M work items launched per second.  If each work item moves just 4 bytes, then we get a bandwidth of 3GB/s. 

I know about making the work items larger, but am concerned that the SDK launch performance is way out of line with respect to openCL on nVidia and other platforms.  I'm running SDK V2.0.

I'll upgrade to SDK V2.01 later this week, after I've got some other things out of the way.  Has anyone else seen this behavior? 


Tarball of the test code is available upon request.