I have been trying to optimize a few kernels I have been developing for OpenCL. The kernels have been running quite well up until now. I have a CPU implementation for my kernels on the side such that I can compare results. I have queried my device, AMD R9 290, for maximum work group sizes and divided up the work accordingly. I have also made my global work size a power of two. I am using the C++ OpenCL API to make my kernel calls building the kernels using the make_kernel() command. Anyways, the optimizations I made were solely to take advantage of wave-front sizes and divide work accordingly such that I can have as many work units active as possible within a work group. By setting up my NDRange with the following API call :
cl::EnqueueArgs range_sliver_args = cl::EnqueueArgs( queue, cl::NullRange, cl::NDRange( dimensionX, dimensionY ), cl::NDRange( dimensionX / 16, dimensionX / 16 ) );
the code fails to produce any results. Note that dimensionX and dimensionY are both a power of two. The first argument is for the command queue, the second for the NDRange offset, the third for the global NDRange, and finally the fourth for the local NDRange. If I set the local NDRange to null using cl::NullRange, the kernel executes perfectly and produces the correct results. However I would like to be able to adjust the local NDRange to test out how the kernel performs. Note, my kernel implementation does not depend on local id's at all, thus adjusting the local range is purely to test performance improvements. However setting a local NDRange produces no results. If anyone has any idea what the problem can be, I would appreciate any input.