our application launches few OpenCL kernels in a loop, each iteration waiting for the previous one to complete (clFinish). One of the kernels is quite complex and uses nearly 18 kB of private memory per work item. We had very hard time making it work on AMD platform (no significant problems with nVidia or Intel). The application ran OK for few iterations of the loop and then suddenly enqueuing of the complex kernel started returning "out of resources" error. Compilation and first enqueue calls were all OK. Finally we tried replacing the __private memory buffers with pieces of __global buffer for each work item (reducing __private usage to about 3 kB per work item) and it started working even on AMD.
My question: Is there any private memory size limit? I'd like to know whether we have fixed the issue in our code (reduced private memory usage) or only fixed one of side effects of some bug which is still there.
All of this was happening on Ubuntu linux (12.04) with following driver:
|[||6.750882] <6>[fglrx] module loaded - fglrx 13.35.5 [Mar 12 2014] with 1 minors|
When we tried with Windows 7, the graphics driver always crashed.
When private memory size exceeds number of registers per thread on the device, rest of the private memory is automatically assigned in global memory space. With this understanding, your code should work with 18KB private memory on AMD devices also.
We are trying to reproduce it to find out why the code is crashing. Meanwhile if you can share your code that is crashing, it would be great help.
We wrote a kernel with each work item having private buffer of size 22K and the kernel iterated 1000 times. The kernel is working fine on our side.
Could you share your code (or better still, a bare minimum code that captures this error, so that a quick debugging can be done) with us?
Hi, we have been doing some simple math here (which we probably should have done before). The kernel was tested on R9 290X, so I assume that there can be up to 112640 work items running in parallel (44 CUs x 40 wavefronts x 64 work items per wavefront). If each of our work items uses 18kB of memory, this yields total of ~2GB of memory. We were also doing some allocations, so I assume that this may be the source of problems: first launch of kernel uses one big buffer (triggering its allocation), then second launch of the kernel triggers allocation of other big buffer, causing the trouble (we need more sets of buffers to be able to interleave transfers with computations).
Does it make sense that kernel launch may trigger big buffer allocation, so that there's not enough memory for private memory of kernel and the enqueue call returns CL_OUT_OF_RESOURCES error?
The code is closed source, so we cannot share it.
we didn't do the experiments, because we have re-implemented the kernel not to use private memory. In our case, we think that the problem was following: the application allocated a lot of memory and launched first few kernels which used lot of private memory. This succeeded, because there was still enough room in GPU's RAM to accommodate the private memory. Then, our application allocated more buffers in GPU global memory, which caused following kernels with private memory to fail, because there was not enough memory left to host the private memory requirements of launched kernels.
Now, our application re-uses the global memory buffers also for storage of variables, which were originally in the private memory buffers. This is quite slower than before, but it is reliable.