Archives Discussions

mfeemster · ‎02-04-2013

Hello all, first post here.

I have a question/request for AMD engineers and am hoping to get some answers. I've been playing with OpenCL for a couple of months now and so far really like what I am seeing. However, there is one glaring flaw/limitation in the hardware of every major GPU today. The flaw is further compounded by a decision AMD made and I was hoping to get some clarity on it.

The issue: __local memory.

In every OpenCL document by either vendor, they stress the importance of using local, on chip memory to do your processing, then writing the results back to device memory when finished. Both in literature, and in practice, this is the single most important way to realize the benefits of GPU computing. Absent proper use of local memory, the performance of the GPU is pretty much in line with the CPU. I've confirmed it in my own experimentation. In conclusion, GPU computing is almost entirely about using local memory, even more so than it is about parallelizing things. Local memory is *the* GPU computing issue.

The questions:

1) Given the importance of local memory, why is the amount of it so incredibly small? Is it that expensive to manufacture? I would like to see at least 1MB per processor, yet we are stuck with a measly 64k. And only recently did AMD bump it up to that. I believe it was a much smaller value for quite a while, rendering it essentially useless.

2) Furthermore, if it's so critical, why does AMD limit the amount a kernel can allocate to 32k, even though each processor can support 64k? This seems like a somewhat arbitrary decision and I am having to put checks in my OpenCL code to allocate different amounts of memory depending on the card vendor.

3) Given 1 & 2, does AMD plan to increase the size of local memory at any time in the future? And in the meantime, does AMD plan to relax the 32k local memory limit and allow for the programmer to use 48k or more per processor?

If I had one recommendation for the major GPU vendors, it would be to do everything in your power to increase the on chip local memory as it is the single most important thing to OpenCL programming. Doing so will have far more impact than any other enhancement.

Any info is appreciated, thanks.

LeeHowes · ‎02-04-2013

1) How much cache do you get per core on a CPU? SRAM isn't free, it already takes up a huge percentage of the area of a chip. In this case we have 256kB of registers, 64kB of allocatable SRAM, 16kB of L1 cache and another few k of instruction cache per core. We then put 32 cores on the chip. That's a fair amount of memory over the chip as a whole. Then you add the L2 cache (512kB I think,off the top of my head) and it's comparable to the amount of cache you get on an 8 core CPU die of a similar process generation.

2) It reduces contention over ports and reduces the amount of area dedicated to wiring to access it. There are tradeoffs to any design. Given that a wavefront is mapped permanently to a SIMD unit, it's access needs for LDS aren't going to change from one port to another. Equally, given that you need multiple workgroups per CU to occupy the machine anyway, and workgroups can't share LDS, in the general case you really won't benefit from using 1/2 of the machine's compute resources as your trade for being able to access double the LDS per workgroup. There will be counter examples, but apparently not enough to justify the extra area taken up buy the circuits to deal with it.

3) Maybe. Caches tend to increase in size with process generation. It is not a bad assumption, but it is a roadmap question that we couldn't comment on.

A lot of developers would rather drop LDS entirely and rely on L1 cache - no question is without its tradeoffs.

View solution in original post

LeeHowes · ‎02-04-2013

1) How much cache do you get per core on a CPU? SRAM isn't free, it already takes up a huge percentage of the area of a chip. In this case we have 256kB of registers, 64kB of allocatable SRAM, 16kB of L1 cache and another few k of instruction cache per core. We then put 32 cores on the chip. That's a fair amount of memory over the chip as a whole. Then you add the L2 cache (512kB I think,off the top of my head) and it's comparable to the amount of cache you get on an 8 core CPU die of a similar process generation.

2) It reduces contention over ports and reduces the amount of area dedicated to wiring to access it. There are tradeoffs to any design. Given that a wavefront is mapped permanently to a SIMD unit, it's access needs for LDS aren't going to change from one port to another. Equally, given that you need multiple workgroups per CU to occupy the machine anyway, and workgroups can't share LDS, in the general case you really won't benefit from using 1/2 of the machine's compute resources as your trade for being able to access double the LDS per workgroup. There will be counter examples, but apparently not enough to justify the extra area taken up buy the circuits to deal with it.

3) Maybe. Caches tend to increase in size with process generation. It is not a bad assumption, but it is a roadmap question that we couldn't comment on.

A lot of developers would rather drop LDS entirely and rely on L1 cache - no question is without its tradeoffs.

mfeemster · ‎02-04-2013

Thanks Lee, I appreciate the info.

I understand you guys can't comment on future architectures, but it's pretty clear that memory bandwidth is the main bottleneck with GPU computing. So if there were one recommendation/request I could make, it would be to focus on improving memory latency and/or increasing LDS/cache sizes.

himanshu_gautam · ‎02-04-2013

Memory bandwidth is the problem with CPU computing as well. It is not unique to GPU computing.

It is the case with virtually any computing paradigm.

In GPGPU, Programmer has direct control over what variables he chooses to cachen and when.

This is a great deal of flexibility.

While working with CPU, the programemr does not have the flexibility of dictating what goes in the cache.

The CPU usually handles this at run-time.

However, x86 architecture offers cache management instructions. Using them will make the code look very complex and possibly un-readable.

LeeHowes · ‎02-04-2013

Isn't memory bandwidth the main benefit of GPU computing for a lot of applications? Discrete GPUs have enormous memory bandwidth available.

Or do you mean the communication over the PCI express bus? In that case APUs are the way we are solving that problem. The tradeoff of course is that actual bandwidth to the GPU is lower on an APU than on a discrete card, so it all depends on what your needs are.

Archives Discussions

A question/appeal to AMD engineers regarding local memory.