our application launches few OpenCL kernels in a loop, each iteration waiting for the previous one to complete (clFinish). One of the kernels is quite complex and uses nearly 18 kB of private memory per work item. We had very hard time making it work on AMD platform (no significant problems with nVidia or Intel). The application ran OK for few iterations of the loop and then suddenly enqueuing of the complex kernel started returning "out of resources" error. Compilation and first enqueue calls were all OK. Finally we tried replacing the __private memory buffers with pieces of __global buffer for each work item (reducing __private usage to about 3 kB per work item) and it started working even on AMD.
My question: Is there any private memory size limit? I'd like to know whether we have fixed the issue in our code (reduced private memory usage) or only fixed one of side effects of some bug which is still there.
All of this was happening on Ubuntu linux (12.04) with following driver:
|[||6.750882] <6>[fglrx] module loaded - fglrx 13.35.5 [Mar 12 2014] with 1 minors|
When we tried with Windows 7, the graphics driver always crashed.