4 Replies Latest reply on Apr 25, 2011 3:44 PM by jdeguara

    Enqueue Kernel performance


      sorry if this is a duplicate post, the posting widget crashed during my first attempt and I don't know what happened to the post.

      I was trying to duplicate the offsetCopy experiment from the nVidia best practices guide. The result was pretty poor bandwidth reports for a simple copy kernel:

      __kernel void offsetCopy(__global float *odata,
      __global float* idata,
      int offset)
      int xid = get_global_id(0) + offset;
      odata[xid] = idata[xid];

      I was getting about 3GB/s on my RV730 in an i7 (nehalem ws) fedora 11 linux box running on the GPU.

      So, I timed the overhead for firing up an empty kernel with no args.

      char * empty_kernel_source =
      "__kernel void empty_kernel() \
      { \

      And this is what I got from my test program:

      [reilly@i7 launchtest]$ ./CLFW_launchtest
      Local work group size = 32 num work items = 33554432
      Kernel execution time: trial = 0  44.750576 ms  44750576  7.4981e+08 work_items / sec
      Kernel execution time: trial = 1  44.743816 ms  44743814  7.49923e+08 work_items / sec
      Kernel execution time: trial = 2  44.743969 ms  44743966  7.49921e+08 work_items / sec
      Kernel execution time: trial = 3  44.743862 ms  44743862  7.49923e+08 work_items / sec

      About 750M work items launched per second.  If each work item moves just 4 bytes, then we get a bandwidth of 3GB/s. 

      I know about making the work items larger, but am concerned that the SDK launch performance is way out of line with respect to openCL on nVidia and other platforms.  I'm running SDK V2.0.

      I'll upgrade to SDK V2.01 later this week, after I've got some other things out of the way.  Has anyone else seen this behavior? 


      Tarball of the test code is available upon request.