4 Replies Latest reply on Jan 6, 2010 5:44 AM by genaganna

    OpenCL optimal work group size

    bubu

      Is there a good way to know a-priori a good size for the workgroup? The more the better?

      In CUDA, we've an Excel's table where you can see the occupancy of the multiprocessors and shared memory. Currently, I need to test my kernel with several vales(32,64,128,256,512) for the workgroup and choose the one that runs faster.

       

      Do you have a tool where we could see how many cycles, memory locks, SIMD split-branching, cache usage, etc... are used for a specific kernel? That would be useful too.

       

      It would be useful to add to the documentation how the memory is cached, sizes, bank conflicts, etc... like it's done in the CUDA SDK(visually,graphically).

       

      thx

        • OpenCL optimal work group size
          genaganna

          Bubu,

                  CLProfiler is released. It is supported in windows only.  See here http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_Performance_Notes.pdf for more details.

           

                 This is no static tool available now to find optimal work group size.

                 Presently you can do as follows.

                      1. Get workGroupSize from clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE

                      2. Get KernelWorkGroupSize from from clGetKernelWorkGroupInfo with CL_KERNEL_WORK_GPOUP_SIZE

                       3. Get minimum of two values and use that value as your optimal workGroupSize

           

           

           

            • OpenCL optimal work group size
              davibu

               

              Originally posted by: genaganna Bubu,

               

                      CLProfiler is released. It is supported in windows only.  See here http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_Performance_Notes.pdf for more details.

               

              Can I ask you if it supposed to work also with VisualC++ Express edition (the one freely available) ? I'm a Linux user and I tried to use VisualC++ with the only pourpuse to try ATI Profiler but it wasn't recognized by my Express edition installation

               

              Originally posted by: genaganna Bubu,

               

                    This is no static tool available now to find optimal work group size.

               

                     Presently you can do as follows.

               

                          1. Get workGroupSize from clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE

               

                          2. Get KernelWorkGroupSize from from clGetKernelWorkGroupInfo with CL_KERNEL_WORK_GPOUP_SIZE

               

                           3. Get minimum of two values and use that value as your optimal workGroupSize

               

               

               



              It is good procedure but hand tuning and testing still offer the best perfomances. For instance, on a small test I did (higher number are better):

              Workgroup size 8 => 890K samples/sec
              Workgroup size 16 => 1719K samples/sec
              Workgroup size 32 => 3373K samples/sec
              Workgroup size 64 => 6486K samples/sec (<= best result)
              Workgroup size 128 => 5515K samples/sec
              Workgroup size 256 => 5436K samples/sec (size suggested by clGetKernelWorkGroupInfo)

              P.S. as side node, in my case, NVIDIA OpenCL driver can return some sub-optimal value for clGetKernelWorkGroupInfo() leading to some realy bad performance on their hardware.