I'm having an issue with the 2.8 AMD App SDK on a Firepro W5000 GPU. The GPU is hosted in a server running Centos 6.3 with fglrx 8.98.2 installed. The issue is that if I access an array of floating point or long values in private memory (i.e. unqualified variables, inside the kernel function`) that is approximately 22400 Bytes (it varies by about 100 Bytes each time I try), the kernel builds and launches, but then the device hangs. I'm fairly sure its deadlocked, although I have only left it for up to a day. When I try kill the job from the command line or using the PID, the error message "../../../thread/semaphore.cpp:87: sem_wait() failed" displays. The GPU remains unresponsive, and is only available after I reboot.
Other than updating the driver and the App SDK (which I'm currently doing), is there something obvious that I'm missing? Is there a limit on the size of arrays that may be declared and used in private memory? If I just declare or, declare and write to the private variable array, then the code runs fine. I've also tried declaring two arrays in private memory and then copying from one to the other. This was also fine. It appears the issue is in copying values from a private variable array to a global one. Is this meant to be illegal in OpenCL 1.2?
I'm able to create and use far larger arrays on both AMD and Intel CPUs (using both the AMD APP SDK and the Apple one) and on CUDA GPUs. According to the specs reported by the W5000, the local memory size is 32768 and cache size is 16384, so I'm quite far off the first and well over the second if those are some sort of limit.
I've attached some test code which reproduces the error. The host file (AMD_test_program.py) requires a Python Opencl Library (PyOpenCL) and Numpy (NumPy — Numpy) installed, but if this makes it more difficult to diagnose the issue, I'm happy to rewrite in C.