I experience some very weird hard lockups when running my OpenCL kernel on GPU. On CPU it runs correctly and gives out the correct results. On GPU sometimes it runs correctly while sometimes it completely hangs the system to a point that it requires a reboot. Strange thing is that this apparently is dependent on the global_work_size and sometimes with a larger global work size, the program does not crash, while with smaller global work size it hangs.
The kernel overall does one read and five writes from/to global memory. The rest consists of arithmetic/bitwise operations on local uint4s (I mean __private ones I believe).
Since there is no SKA for linux and my debugging abilities are limited, I tried commenting out code to find out what exactly causes the problem, however the behavior is rather erratic and global worksize-dependent. Basically the hangs occur once a certain number of bitwise/arithmetics are performed on local variables.
My grid is 1-dimensional one and I provide NULL as local_work_size parameter so that OpenCL should choose the most appropriate value depending on the registers pressure and stuff like that. My theory is that for some reason, the OpenCL implementation does not properly calculate the register usage, thus the local work size is not being calculated properly and that leads to hard lockups (?!?).
Anyway, indeed I solved the problem by providing a hardcoded local_work_size value (chosen so that the global_work_size is divisible by that). Of course, performance dropped by about 20-30% due to that, yet that's acceptable to me.
I am using Radeon HD4670 and I am wondering whether this is OpenCL-related issue or hardware one. I will buy a 6870 card in the next 1-2 weeks and will do some testing to see whether this could be reproduced on that hardware as well.
I can post the kernel code, it's just about 100-200 lines, however the host code needed to properly setup all the parameters is much more than that. I can try to write a simplified test case though.