Archives Discussions

bubu · ‎01-04-2010

Is there a good way to know a-priori a good size for the workgroup? The more the better?

In CUDA, we've an Excel's table where you can see the occupancy of the multiprocessors and shared memory. Currently, I need to test my kernel with several vales(32,64,128,256,512) for the workgroup and choose the one that runs faster.

Do you have a tool where we could see how many cycles, memory locks, SIMD split-branching, cache usage, etc... are used for a specific kernel? That would be useful too.

It would be useful to add to the documentation how the memory is cached, sizes, bank conflicts, etc... like it's done in the CUDA SDK(visually,graphically).

thx

genaganna · ‎01-04-2010

Bubu,

CLProfiler is released. It is supported in windows only. See here http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_Performance_Notes.pdf for more details.

This is no static tool available now to find optimal work group size.

Presently you can do as follows.

1. Get workGroupSize from clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE

2. Get KernelWorkGroupSize from from clGetKernelWorkGroupInfo with CL_KERNEL_WORK_GPOUP_SIZE

3. Get minimum of two values and use that value as your optimal workGroupSize

davibu · ‎01-04-2010

Originally posted by: genaganna Bubu,

CLProfiler is released. It is supported in windows only. See here http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_Performance_Notes.pdf for more details.

Can I ask you if it supposed to work also with VisualC++ Express edition (the one freely available) ? I'm a Linux user and I tried to use VisualC++ with the only pourpuse to try ATI Profiler but it wasn't recognized by my Express edition installation

Originally posted by: genaganna Bubu,

      This is no static tool available now to find optimal work group size.

       Presently you can do as follows.

            1. Get workGroupSize from clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE

            2. Get KernelWorkGroupSize from from clGetKernelWorkGroupInfo with CL_KERNEL_WORK_GPOUP_SIZE

             3. Get minimum of two values and use that value as your optimal workGroupSize

It is good procedure but hand tuning and testing still offer the best perfomances. For instance, on a small test I did (higher number are better):

Workgroup size 8 => 890K samples/sec
Workgroup size 16 => 1719K samples/sec
Workgroup size 32 => 3373K samples/sec
Workgroup size 64 => 6486K samples/sec (<= best result)
Workgroup size 128 => 5515K samples/sec
Workgroup size 256 => 5436K samples/sec (size suggested by clGetKernelWorkGroupInfo)

P.S. as side node, in my case, NVIDIA OpenCL driver can return some sub-optimal value for clGetKernelWorkGroupInfo() leading to some realy bad performance on their hardware.

genaganna · ‎01-04-2010

Nvidia Excel sheet also does not say any thing about access patterns and latencies. Hand tuning alway gives optimal workgroup size.

I am not sure Profiler supposed to work with Express edition or not.

genaganna · ‎01-06-2010

davibu,

Please try following to recognize profiler

1. Please select view -> other windows -> OpenCL Session

2. Please install first Express edition and then install SDK. if profiler is not recognized follow step 1.

Make sure you are using Express edition 2008.

Archives Discussions

OpenCL optimal work group size