cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nibal
Challenger

LDS dynamic stack allocation

Jump to solution

Let's assume we have a 1D kernel with 64 global & local work size (1 workgroup). It goes like this:

__kernel void test

(

args...

)

{

     __local float avg[4096];

avg could be allocated by each item in the group. That would result in 64 different allocations, easily overrunning the 32 K LDS limit. Furthermore wouldn't be common to the group.

So it's not allocated by the group. Who allocates it? If only first access, item[0] does, how do the rest of items know where in LDS avg is?

In C, this scoping is local and avg would be available only to this item and only within the test scope. That is not the case in ocl, though, if it is shared by the group.

Unfortunately CodeXL won't display __local variables, and printf is very problematic form within the kernel:-(

Does each item have its own stack? How is it implemented, SGPRS?

Tags (2)
0 Likes
1 Solution

Accepted Solutions
realhet
Miniboss

Re: LDS dynamic stack allocation

Jump to solution

Oh indeed 'avg', your declaration.

So localSize=64, globalSize=64 -> Only on allocation (4K*4 bytes), workitem0..workitem63 accesses the same avg[] values.

localSize=128, globalSize=512 -> 4 allocations of avg[4k]:

wi0..wi127 -> first allocation

...

wi384..wi511 -> fourth allocation

total LDS allocation = 4k*4*4 = 64KB lds memory in total. A side note: 64K is the maximum that can be allocated on a single GCN ComputeUnit, so this amount is work fit well on 1 CU.

LDS is mainly for communicating between neighboring workitems. And it is also good for small lookup tables (as well as the L1 cache).

View solution in original post

0 Likes
7 Replies
realhet
Miniboss

Re: LDS dynamic stack allocation

Jump to solution

LDS belongs to a specific workgroup (1..3 wavefronts).

It is allocated right before the kernel is started. And deallocated after the kernel finishes.

The allocation's properties (memory offset inside that 64K memory, and size) is stored in hardware registers, those you can't access from opencl.

The hardware ensures that you can't write into unallocated areas.

Scoping: While using LDS this is only a C language feature to make your code prettier. The largest required LDS memory is allocated outside the kernel. It will be large enough to ensure all the scope's LDS needs.

LDS is a small amount of memory, refressing it from L2 cache is not a big deal. Unless your kernel consists only of a few instructions, that is comparable to few kilobytes of LDS initialization.

nibal
Challenger

Re: LDS dynamic stack allocation

Jump to solution

Thanks for your fast reply and clarifications.

I still have some questions, though.

I imagine that the __local declaration reserves avg in the initially allocated LDS. 64 items in my example use it. Do they get 64 different avg?

If there is only 1 common avg, who allocates it and how rest know where to find it?

Who can access the avg[10] in my example?

0 Likes
realhet
Miniboss

Re: LDS dynamic stack allocation

Jump to solution

What do you mean 'avg'? (I cannot think anything else than 'average' which is obviously wrong at the given context.)

0 Likes
nibal
Challenger

Re: LDS dynamic stack allocation

Jump to solution

By avg[10] I mean the __local array in my initial post:

__local avg[4096]

No special meaning, could have named it anything.

0 Likes
realhet
Miniboss

Re: LDS dynamic stack allocation

Jump to solution

Oh indeed 'avg', your declaration.

So localSize=64, globalSize=64 -> Only on allocation (4K*4 bytes), workitem0..workitem63 accesses the same avg[] values.

localSize=128, globalSize=512 -> 4 allocations of avg[4k]:

wi0..wi127 -> first allocation

...

wi384..wi511 -> fourth allocation

total LDS allocation = 4k*4*4 = 64KB lds memory in total. A side note: 64K is the maximum that can be allocated on a single GCN ComputeUnit, so this amount is work fit well on 1 CU.

LDS is mainly for communicating between neighboring workitems. And it is also good for small lookup tables (as well as the L1 cache).

0 Likes
nibal
Challenger

Re: LDS dynamic stack allocation

Jump to solution

So,  on first request, LDS mechanism allocates array and returns pointer to it. On subsequent requests of the same workgroup, it just returns the pointer to it, since it was already allocated.

Tyvm;-)

0 Likes
realhet
Miniboss

Re: LDS dynamic stack allocation

Jump to solution

To be precise, there are no subsequent requests.

Whenever a workgroup is assigned to a compute unit, the hardware will wait until enough LDS memori is present on that unit, and then allocates it. Launches the program on that particular workgroup and after it finishes, it releases the LDS memory.

So when your kernel starts, you have to assume that there is memory garbage in the local variables. No persistency at all. GDS is persistent, LDS is only for the life time of a workgroup.

0 Likes