cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nibal
Challenger

LDS dynamic stack allocation

Let's assume we have a 1D kernel with 64 global & local work size (1 workgroup). It goes like this:

__kernel void test

(

args...

)

{

     __local float avg[4096];

avg could be allocated by each item in the group. That would result in 64 different allocations, easily overrunning the 32 K LDS limit. Furthermore wouldn't be common to the group.

So it's not allocated by the group. Who allocates it? If only first access, item[0] does, how do the rest of items know where in LDS avg is?

In C, this scoping is local and avg would be available only to this item and only within the test scope. That is not the case in ocl, though, if it is shared by the group.

Unfortunately CodeXL won't display __local variables, and printf is very problematic form within the kernel:-(

Does each item have its own stack? How is it implemented, SGPRS?

0 Likes
1 Solution

Oh indeed 'avg', your declaration.

So localSize=64, globalSize=64 -> Only on allocation (4K*4 bytes), workitem0..workitem63 accesses the same avg[] values.

localSize=128, globalSize=512 -> 4 allocations of avg[4k]:

wi0..wi127 -> first allocation

...

wi384..wi511 -> fourth allocation

total LDS allocation = 4k*4*4 = 64KB lds memory in total. A side note: 64K is the maximum that can be allocated on a single GCN ComputeUnit, so this amount is work fit well on 1 CU.

LDS is mainly for communicating between neighboring workitems. And it is also good for small lookup tables (as well as the L1 cache).

View solution in original post

0 Likes
7 Replies
realhet
Miniboss

LDS belongs to a specific workgroup (1..3 wavefronts).

It is allocated right before the kernel is started. And deallocated after the kernel finishes.

The allocation's properties (memory offset inside that 64K memory, and size) is stored in hardware registers, those you can't access from opencl.

The hardware ensures that you can't write into unallocated areas.

Scoping: While using LDS this is only a C language feature to make your code prettier. The largest required LDS memory is allocated outside the kernel. It will be large enough to ensure all the scope's LDS needs.

LDS is a small amount of memory, refressing it from L2 cache is not a big deal. Unless your kernel consists only of a few instructions, that is comparable to few kilobytes of LDS initialization.

Thanks for your fast reply and clarifications.

I still have some questions, though.

I imagine that the __local declaration reserves avg in the initially allocated LDS. 64 items in my example use it. Do they get 64 different avg?

If there is only 1 common avg, who allocates it and how rest know where to find it?

Who can access the avg[10] in my example?

0 Likes

What do you mean 'avg'? (I cannot think anything else than 'average' which is obviously wrong at the given context.)

0 Likes

By avg[10] I mean the __local array in my initial post:

__local avg[4096]

No special meaning, could have named it anything.

0 Likes

Oh indeed 'avg', your declaration.

So localSize=64, globalSize=64 -> Only on allocation (4K*4 bytes), workitem0..workitem63 accesses the same avg[] values.

localSize=128, globalSize=512 -> 4 allocations of avg[4k]:

wi0..wi127 -> first allocation

...

wi384..wi511 -> fourth allocation

total LDS allocation = 4k*4*4 = 64KB lds memory in total. A side note: 64K is the maximum that can be allocated on a single GCN ComputeUnit, so this amount is work fit well on 1 CU.

LDS is mainly for communicating between neighboring workitems. And it is also good for small lookup tables (as well as the L1 cache).

0 Likes

So,  on first request, LDS mechanism allocates array and returns pointer to it. On subsequent requests of the same workgroup, it just returns the pointer to it, since it was already allocated.

Tyvm;-)

0 Likes

To be precise, there are no subsequent requests.

Whenever a workgroup is assigned to a compute unit, the hardware will wait until enough LDS memori is present on that unit, and then allocates it. Launches the program on that particular workgroup and after it finishes, it releases the LDS memory.

So when your kernel starts, you have to assume that there is memory garbage in the local variables. No persistency at all. GDS is persistent, LDS is only for the life time of a workgroup.

0 Likes