cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ankhster
Adept II

Workgroup Allocations

I've been trying to work out how to reliably access local memory for each workgroup that I reserve through an argument parameter, but getting somewhat ambiguous results. I've tried searching for for articles relating to this, to no avail.

Working with a large data set, producing some 1944 bytes (486 uints) per work item, rounded to a 2048 byte boundary I'm looking to limit the number of work items per workgroup to 16 to prevent overflow.

I know that I have 32768 bytes available for each workgroup using 7970, which I can access without problem through workgroups 0 to 31. My question is, what happens when I have 2048 workgroups and how is the reserved memory addressed?

Believing that when I was addressing workgroup 32 (and 64, 96, 128, etc) that it would access the local memory in workgroup 0, ie. group_id & 31, I cannot seem to establish whether or not this is the case.

While I would only have 16 work items where I'm looking at referencing local memory by local_id << 11, it could be possible to use 256 work items, referencing local memory by (local_id & 15) << 11 and using atomic adds.

Any clarifications and insight to how I can best tackle this problem would be greatly appreciated.

0 Likes
1 Solution

32k per 16 item group is going to *seriously* underutilise the device. On a GCN GPU the best you can expect from that is 1/8th peak (assuming that 2x32k can fit in LDS, which it may not if any is used by the compiler) because most of the time the ALUs would be idle, even with perfect memory fetching.


Local memory is allocated per work group. As a workgroup is issued to the device it is allocated, as the workgroup completes it is freed. So the local memory for workgroups > 32 doesn't exist until earlier groups complete (and when they do their allocations no longer exist).

View solution in original post

0 Likes
13 Replies
binying
Challenger

"I've been trying to work out how to reliably access local memory for each workgroup that I reserve through an argument parameter, but getting somewhat ambiguous results."

--Could you post a minimum kernel to show this issue?

0 Likes

Certainly...

I've arranged example to initialize the local memory for the first 32 groups, atomically store the global_id according to the group_id and local_id. Each 32KB group local memory (for first 32 banks) is split into 16 sub-banks, containing 512 uints (2048 bytes). This is derived by (lid << 9) as there are 16 work items in each group. I'm adding an offset of (gid >> 5) so that the first 512 numbers will store at offset 0 in each sub-bank and the second 512 numbers are stored at offset 1.

here is the host file with the essentials - setup, error checking and tear-down omitted for brevity.

currentThreads = 1024;

localUnits = 16;

outDevice = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uint) * 8192 * 32, NULL, NULL);    // 1MB Global memory

outBuffer = (cl_uint*) malloc(sizeof(cl_uint) * 8192 * 32);                        // 1MB Host memory

err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &outDevice);

err |= clSetKernelArg(kernel, 1, sizeof(cl_uint) * 8192, NULL);

err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &currentThreads, &localUnits, 0, NULL, NULL);

err  = clEnqueueReadBuffer(commands, outDevice, CL_TRUE, 0, sizeof(cl_uint) * 8192 * 32, outBuffer, 0, NULL, NULL);

clFinish(commands)

for(x = 0; x < 32; x++)            // Group Local Memory

{

    printf("Group-SubBank\n");

    for(y = 0; y < 16; y++)        // Sub Bank

    {

        printf("%02X-%02X: ", x, y);

        for(z = 0; z < 8; z++)    // Offset

            printf("%u\t ", outBuffer[((x * 8192) + (y * 512) + z)]);

        printf("\n");

    }

    printf("\n");

}

for(x = 0; x < 262144; x++)    // This looks for numbers derived from groups 32 to 63 but finds none

    if(outBuffer > 511)

        printf("%u\n", x);

Here is the kernel.

#define DATA_SIZE 512 // 512 uints = 2048 bytes

__kernel void memTest(__global uint *outBuffer2,

                      __local uint *outBuffer)

{

    int lid = get_local_id(0);

    int gid = get_group_id(0);

    int gbl = get_global_id(0);

    uint x;

    if((lid == 0) && (gid < 32))

        for(x = 0; x < 8192; x++)    // 8192 uints = 32768 bytes

            outBuffer = 0;

    barrier(CLK_LOCAL_MEM_FENCE);        // Wait for initialize

    atomic_add(outBuffer + (lid << 9) + (gid >> 5), gbl);

    printf("Gbl: %u, Grp: %u, Loc: %u - Writing group %u, absolute %u, sub-bank %u, position %u\n", gbl, gid, lid, gid & 0x1F, (lid << 9) + (gid >> 5), lid, (gid >> 5));    // Make sure it looks ok

    barrier(CLK_LOCAL_MEM_FENCE);        // Wait to complete

    if((lid == 0) && (gid < 32))        // Write to global memory

        for(x = 0; x < 8192; x++)

            outBuffer2[gid * 8192 + x] = outBuffer;

}

The first 512 numbers are stored in their correct places - in the correct offset of the correct sub-bank of the correct group local memory. However, the numbers 512 to 1023 are nowhere to be seen, as verified by the last loop of the host program.

Hope this helps to explain what I'm trying to do.

0 Likes

I dont quite understand what you are doing but you have (gid < 32) in your if statements, wouldnt this stop workgroups 32-63 to not able to write to output buffer? (also your buffer is not large enough). You have global 1024, local 16 so you will have workgroups from 0 to 63, but then you stop in gid<32

Also (lid << 9) would result in (for local IDs 0 to 15) 0, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680. So the first 32 workgroups (0-31) would add to gbl same places in outBuffer,  for 32 times? Then if you allowed gid>=32 it would go to 1, 513 etc. for 32-63.... Many of the places wouldnt be used at all, am I calculating something wrong? Did you want (lid*gid)+lid instead?

Then each workgroup calculates 16 values, but you are writing to global memory with a loop which goes until 8192 in the end? Of course to go over 32 workgroups (till 64 workgroups) with  (gid * 8192 + x), you need to allocate double the memory for output array.

Am I understanding your program wrong?

yurtesen wrote:

I dont quite understand what you are doing but you have (gid < 32) in your if statements, wouldnt this stop workgroups 32-63 to not able to write to output buffer? (also your buffer is not large enough). You have global 1024, local 16 so you will have workgroups from 0 to 63, but then you stop in gid<32

I mistakenly thought that from previous results that local data from workgroups >=32 may be mapped to a corresponding workgroup memory via a bitwise and mask.

yurtesen wrote:

Also (lid << 9) would result in (for local IDs 0 to 15) 0, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680. So the first 32 workgroups (0-31) would add to gbl same places in outBuffer,  for 32 times? Then if you allowed gid>=32 it would go to 1, 513 etc. for 32-63.... Many of the places wouldnt be used at all, am I calculating something wrong? Did you want (lid*gid)+lid instead?

No, it was an example to use the data range that I expect to use but only storing a small subset of the data, so that only a few places need be displayed for each sub-bank to verify. I'll expand on this in my next post.

yurtesen wrote:

Then each workgroup calculates 16 values, but you are writing to global memory with a loop which goes until 8192 in the end? Of course to go over 32 workgroups (till 64 workgroups) with  (gid * 8192 + x), you need to allocate double the memory for output array.

Am I understanding your program wrong?

You understood correctly and I was mistaken in my understanding to how the local memory operated, as previously mentioned.

0 Likes

if you printf outBuffer before " outBuffer2[gid * 8192 + x] = outBuffer; ", is the output what you expected?


0 Likes

binying wrote:

if you printf outBuffer before " outBuffer2[gid * 8192 + x] = outBuffer; ", is the output what you expected?

The printf debug line in the kernel outputs what I expect. More on this in my next post.

0 Likes

Thanks for looking into this.

I've tried manipulating memory within workgroups >=32 in my main program, but I've been getting spurious results. I retried within this example (removing the gid<32 condition) by increasing the parameters (see below) similar to my main program and it worked as expected. Doh! I guess I need to go back to my main program and look at it more.

host:

currentThreads = 32768;

outDevice = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uint) * 8192 * 2048, NULL, NULL);

outBuffer = (cl_uint*) malloc(sizeof(cl_uint) * 8192 * 2048);

err  = clEnqueueReadBuffer(commands, outDevice, CL_TRUE, 0, sizeof(cl_uint) * 8192 * 2048, outBuffer, 0, NULL, &read_event);

// snip

for(x = 0; x < 2048; x++)            // Group Local Memory

{

    printf("Group-SubBank\n");

    for(y = 0; y < 16; y++)        // Sub Bank

    {

        printf("%04X-%02X: ", x, y);

        for(z = 0; z < 64; z++)    // Offset

            printf("%04X\t ", outBuffer[((x * 8192) + (y * 512) + z)]);

        printf("\n");

    }

    printf("\n");

}

kernel:

Kernel:

if(lid == 0)

    for(x = 0; x < 8192; x++)    // 8192 uints = 32768 bytes

        outBuffer = 0;

// Snip

if(lid == 0)

    for(x = 0; x < 8192; x++)

        outBuffer2[gid * 8192 + x] = outBuffer;

In my main program, I run 32768 threads consisting of 2048 workgroups, each with 16 work items. Unfortunately, the number of work items is so low because of the size of the data set produced; if I tried to do more within a work unit then I'd have to use ulong, halving the the number of work items.

Maybe some extra insight into how the local memory for the number workgroups >=32 would help me. As I'm allocating 32KB of local memory per workgroup and there are 2048 workgroups, this would result in 64MB of local data. Obviously I don't have this amount of local memory on board, so the kernel must be swapping the local memory to global (or host?) memory. The question is, is there a way to determine how this is actually processed? I.e. would it process the first 32 workgroups then the next 32 workgroups, or would it process everything in one go, paging local memory in / out all over the place? The latter would not be too good as each work item iterates about 1000 times. This would potentially result in over 4 million page swaps per kernel run - ouch!

Note that while I'm using a power-of-two to manipulate the number of work items, I'm actually working with primes. While possible, it wouldn't be a simple case of reducing the workload by breaking down the kernel into smaller parts. I've recently discovered the topic of overlapping kernels - maybe this could be a way for me to effectively reduce the size and hiding the cost of additional expensive kernel calls.

0 Likes

32k per 16 item group is going to *seriously* underutilise the device. On a GCN GPU the best you can expect from that is 1/8th peak (assuming that 2x32k can fit in LDS, which it may not if any is used by the compiler) because most of the time the ALUs would be idle, even with perfect memory fetching.


Local memory is allocated per work group. As a workgroup is issued to the device it is allocated, as the workgroup completes it is freed. So the local memory for workgroups > 32 doesn't exist until earlier groups complete (and when they do their allocations no longer exist).

0 Likes

Thank you for your contribution.

I appreciate that each workgroup is going to be severely impeded in reaching their full potential. My design obviously needs a serious overhaul to get the most out of the GPU. However, my main hurdle was to gain a better understanding to how the workgroups participate throughout the execution of the kernel. Given that each workgroup starts and completes depending on availability simplifies things somewhat - significantly. Thank you for that.

As it currently stands in my example, I don't really need to use the atomic_add as each work item has its own sub-bank in local memory, and that each group has its own memory space that is allocated and de-allocated upon use and completion. This will be my first change if I can reliably add more work units to better utilize local memory without causing unnecessary overflows in my results.

I currently have 80, 81 or 97 (local) memory locations that I want each work item to update, depending on the model that I use. Would there be a significant impact in performance for this many atomic_adds / atomic_incs per work item, if I were to maximize the number of work items to 256 per group? Maybe I can keep my current model of sub-bank mapping (2KB per sub-bank) and assign 16 work items per sub-bank by sbank = (lid & 0xF0) << 5 assuming I'm still using uints? This, I hope, would allow more work items to access local memory simultaneously, rather than reducing the local memory allocation to 2KB and having all 256 work items fighting for access.

0 Likes

When we were working on sorting code on Evergreen chips we found that atomics in LDS were faster than allocating multiple bins and reducing later. LDS atomics on the AMD chips can run at or near full rate (and if you don't need the return value are also asynchronous with the wavefront, I think, so you get some extra overlapping of work for free). We ended up reducing the number of banks to a pretty low number. To be sure, though, you'd have to do some experiments, I think. If your collision rate is high then you may find that multiple banks works better. If your collision rate is higher than expected form random collisions, though, you can probably find a way to skew the bank layout to reduce it.

Thanks, that's very informative. I think your last two repsonses have both answered and provided additional insight to helping me get the most of out my optimizations, especially how the workgroups are processed. If there're any sources of information available relating to the topics discussed, I'm sure others as well as myself would find it very useful to reference.

Thanks to yurtesen too for helping me to bring clarity and understanding in my request for assistance.

0 Likes

I think I covered it in the book, I did try to give a reasonable description of how OpenCL maps to AMD's GPUs:

http://www.amazon.com/Heterogeneous-Computing-OpenCL-Second-Edition/dp/0124058949/ref=sr_1_4?ie=UTF8...

Beyond that I'm not sure. Most content I know of is in powerpoint form, which isn't so useful if you don't hear the associated talk (particularly my talks as I keep my slides text light so the diagrams may not mean much on their own).

0 Likes

Thanks. I looked at that book but went with another that I thought would being me up to speed faster. It appears that while I got up to speed fast enough in some areas, I outgrew the book faster than expected and perhaps missed out on some of the fundamentals that I'm now asking. Typical. I'll look at ordering that book later today.

0 Likes