I'm having an issue with the 2.8 AMD App SDK on a Firepro W5000 GPU. The GPU is hosted in a server running Centos 6.3 with fglrx 8.98.2 installed. The issue is that if I access an array of floating point or long values in private memory (i.e. unqualified variables, inside the kernel function`) that is approximately 22400 Bytes (it varies by about 100 Bytes each time I try), the kernel builds and launches, but then the device hangs. I'm fairly sure its deadlocked, although I have only left it for up to a day. When I try kill the job from the command line or using the PID, the error message "../../../thread/semaphore.cpp:87: sem_wait() failed" displays. The GPU remains unresponsive, and is only available after I reboot.
Other than updating the driver and the App SDK (which I'm currently doing), is there something obvious that I'm missing? Is there a limit on the size of arrays that may be declared and used in private memory? If I just declare or, declare and write to the private variable array, then the code runs fine. I've also tried declaring two arrays in private memory and then copying from one to the other. This was also fine. It appears the issue is in copying values from a private variable array to a global one. Is this meant to be illegal in OpenCL 1.2?
I'm able to create and use far larger arrays on both AMD and Intel CPUs (using both the AMD APP SDK and the Apple one) and on CUDA GPUs. According to the specs reported by the W5000, the local memory size is 32768 and cache size is 16384, so I'm quite far off the first and well over the second if those are some sort of limit.
I've attached some test code which reproduces the error. The host file (AMD_test_program.py) requires a Python Opencl Library (PyOpenCL) and Numpy (NumPy — Numpy) installed, but if this makes it more difficult to diagnose the issue, I'm happy to rewrite in C.
OK, tested with the latest drivers for Linux (fglrx 13.15.3) and the 2.9 APP SDK. The above code now runs until the array is 32768 Bytes big, and then it fails to build with an error stating "Error:E012:Insufficient Local Resources!". This makes more sense as it matches the reported local memory size on the GPU.
I am still however getting similar behaviour to before in my original code which prompted this post: when a kernel has a large private arrays within them, the code for the GPU builds, but becomes unresponsive after launch. I'll see if I can find an isolatable cause, but would appreciate any suggestions in the meanwhile.
As per OpenCL spec, I don't think there is any limit on private memory. GPU implementation tries to first store the private memory in register files if it fits in it, otherwise it spills out to main memory. So if this the case, you should still get the result (but it would be slower). Getting unresponsive behavior may be due to some issue in your code.
Can you send your code in C? I would like to test it with fglrx 8.98.2 and SDK 2.8.