Archives Discussions

ankhster · ‎11-19-2012

I've been trying to work out how to reliably access local memory for each workgroup that I reserve through an argument parameter, but getting somewhat ambiguous results. I've tried searching for for articles relating to this, to no avail.

Working with a large data set, producing some 1944 bytes (486 uints) per work item, rounded to a 2048 byte boundary I'm looking to limit the number of work items per workgroup to 16 to prevent overflow.

I know that I have 32768 bytes available for each workgroup using 7970, which I can access without problem through workgroups 0 to 31. My question is, what happens when I have 2048 workgroups and how is the reserved memory addressed?

Believing that when I was addressing workgroup 32 (and 64, 96, 128, etc) that it would access the local memory in workgroup 0, ie. group_id & 31, I cannot seem to establish whether or not this is the case.

While I would only have 16 work items where I'm looking at referencing local memory by local_id << 11, it could be possible to use 256 work items, referencing local memory by (local_id & 15) << 11 and using atomic adds.

Any clarifications and insight to how I can best tackle this problem would be greatly appreciated.

LeeHowes · ‎11-21-2012

32k per 16 item group is going to *seriously* underutilise the device. On a GCN GPU the best you can expect from that is 1/8th peak (assuming that 2x32k can fit in LDS, which it may not if any is used by the compiler) because most of the time the ALUs would be idle, even with perfect memory fetching.

Local memory is allocated per work group. As a workgroup is issued to the device it is allocated, as the workgroup completes it is freed. So the local memory for workgroups > 32 doesn't exist until earlier groups complete (and when they do their allocations no longer exist).

View solution in original post

Archives Discussions

Workgroup Allocations