Hello all, first post here.
I have a question/request for AMD engineers and am hoping to get some answers. I've been playing with OpenCL for a couple of months now and so far really like what I am seeing. However, there is one glaring flaw/limitation in the hardware of every major GPU today. The flaw is further compounded by a decision AMD made and I was hoping to get some clarity on it.
The issue: __local memory.
In every OpenCL document by either vendor, they stress the importance of using local, on chip memory to do your processing, then writing the results back to device memory when finished. Both in literature, and in practice, this is the single most important way to realize the benefits of GPU computing. Absent proper use of local memory, the performance of the GPU is pretty much in line with the CPU. I've confirmed it in my own experimentation. In conclusion, GPU computing is almost entirely about using local memory, even more so than it is about parallelizing things. Local memory is *the* GPU computing issue.
1) Given the importance of local memory, why is the amount of it so incredibly small? Is it that expensive to manufacture? I would like to see at least 1MB per processor, yet we are stuck with a measly 64k. And only recently did AMD bump it up to that. I believe it was a much smaller value for quite a while, rendering it essentially useless.
2) Furthermore, if it's so critical, why does AMD limit the amount a kernel can allocate to 32k, even though each processor can support 64k? This seems like a somewhat arbitrary decision and I am having to put checks in my OpenCL code to allocate different amounts of memory depending on the card vendor.
3) Given 1 & 2, does AMD plan to increase the size of local memory at any time in the future? And in the meantime, does AMD plan to relax the 32k local memory limit and allow for the programmer to use 48k or more per processor?
If I had one recommendation for the major GPU vendors, it would be to do everything in your power to increase the on chip local memory as it is the single most important thing to OpenCL programming. Doing so will have far more impact than any other enhancement.
Any info is appreciated, thanks.