OpenCL

lolliedieb · ‎05-23-2020

Well the question in the title already hits it.

I got a rather simple kernel, which uses 20 vgpr and the complete 32 kByte of shared memory. So all fine for running 2x 1024 threads per work group. So fine so far.

But: in the moment I later on read stuff from the same addresses again for a 2nd pass over my data structure I can observe vgpr usage going up - and the reason is not data is stored there, but the addresses to be load from are cached.

Further I tried to extend my shared memory by parking some of the data in registers - but also then when swaping memory multiple times between vreg and shared memory the addresses used get cached in vgpr too. Both this effects cause my used vgpr count to explode far behind the 32 registers I can use without loosing occupancy.

Now my question is: is there a way to prevent the compiler to store this addresses? Because the calculation for them is a very easy formula done with just a few integer ops depending on get_group_id(0) and get_local_id(0) and should be way cheap enough to not hurt too compared with the lower occupancy. Sadly I can not use inline assembly since I need to be able to compile this code with the just in time compiler on PAL platform and Windows for Polaris and Vega architecture - and that drivers still fail to provide this feature ROCm has since year now

Thanks

dipak · ‎05-26-2020

Thank you for the above query. Could you please share the kernel code?

Thanks.

lolliedieb · ‎05-26-2020

Unfortunately I can not share the full code, but to outline it:

My kernel needs to process two global arrays and one of them twice. Each thread is known to load 4 uint4 from each array.

So the pattern is

uint4 load0, load1, load2, load3;
load0 = buffer1[grpOfs + lId];
load1 = buffer1[grpOfs + 1024 + lId];
load2 = buffer1[grpOfs + 2048 + lId];
load3 = buffer1[grpOfs + 3072 + lId];
...
(do something with this data)
load0 = buffer2[grpOfs + lId];
load1 = buffer2[grpOfs + 1024 + lId];
load2 = buffer2[grpOfs + 2048 + lId];
load3 = buffer2[grpOfs + 3072 + lId];
...
(do something else with this data)
load0 = buffer1[grpOfs + lId];
load1 = buffer1[grpOfs + 1024 + lId];
load2 = buffer1[grpOfs + 2048 + lId];
load3 = buffer1[grpOfs + 3072 + lId];
...
(2nd pass over first data set)

In this situation the compiler sores all the addresses involved in the buffer1 loads - so 4x ulong since I am using 64 bit addressing. While now the buffer2 elements get processed the reg count gets increased so high that it hurts the occupancy.

Now if I replace the "buffer1" by some new pointer - pointing to same address but not obvious enough for the compiler - or if I do a modification to "lId" that does nothing (but what compiler can not know since I add some zero that was stored in a buffer), then the register pressure goes down by 7 registers - since the addresses get recomputed. The performance impact of this is very low. Its rather the other way around: the kernel gets much quicker, because the occupancy increases.

So my question is: can I force the compiler somehow more elegant to not store addresses when getting under register pressure / exceeding a set register limit.

dipak · ‎05-27-2020

Thank you for sharing the above information and providing the code snippet. I'll check with the compiler team if they have any suggestion in this regard.

Thanks.

dipak · ‎06-01-2020

As I've been informed, there is no direct way to force the compiler to restrict the above register usage.

Thanks.

OpenCL

Is there an elegant way to force recalculation (of values or addresses)