Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Enqueue Kernel performance


sorry if this is a duplicate post, the posting widget crashed during my first attempt and I don't know what happened to the post.

I was trying to duplicate the offsetCopy experiment from the nVidia best practices guide. The result was pretty poor bandwidth reports for a simple copy kernel:

__kernel void offsetCopy(__global float *odata,
__global float* idata,
int offset)
int xid = get_global_id(0) + offset;
odata[xid] = idata[xid];

I was getting about 3GB/s on my RV730 in an i7 (nehalem ws) fedora 11 linux box running on the GPU.

So, I timed the overhead for firing up an empty kernel with no args.

char * empty_kernel_source =
"__kernel void empty_kernel() \
{ \

And this is what I got from my test program:

[reilly@i7 launchtest]$ ./CLFW_launchtest
Local work group size = 32 num work items = 33554432
Kernel execution time: trial = 0  44.750576 ms  44750576  7.4981e+08 work_items / sec
Kernel execution time: trial = 1  44.743816 ms  44743814  7.49923e+08 work_items / sec
Kernel execution time: trial = 2  44.743969 ms  44743966  7.49921e+08 work_items / sec
Kernel execution time: trial = 3  44.743862 ms  44743862  7.49923e+08 work_items / sec

About 750M work items launched per second.  If each work item moves just 4 bytes, then we get a bandwidth of 3GB/s. 

I know about making the work items larger, but am concerned that the SDK launch performance is way out of line with respect to openCL on nVidia and other platforms.  I'm running SDK V2.0.

I'll upgrade to SDK V2.01 later this week, after I've got some other things out of the way.  Has anyone else seen this behavior? 


Tarball of the test code is available upon request.



4 Replies
Adept I

What results do you get with offset 0?


I don't remember the results for 0 offset, but they were identical for all offsets as the kernel launch time swamps out all other effects.  Note that with a zero work kernel we are moving 4 bytes per kernel invocation at best -- 3GB/s. And that's what I measured for all offsets.

The problem here is the kernel launch rate on my 4000 series board.  The speed for the HD 5870 that I just installed is much much higher -- > 5e9 kernels launched per second.


Could you try the profiler to get the kernel time?


Is this still an issue?  The original post read that the kernel execution time was 44ms but upon my testing I am getting 3ms.  Is there a golden number where fast is fast enough?