Hey there,
I got a code that needs to share data among threads in blocks of 4, so thread i needs to access values from threads (i & 0xFC) + 0 ... (i & 0xFC) + 3.
When writing such a code in GCN / RDNA assembler I would usually use cross lane operations, but sadly I have code that needs to run on the amdgpu-pro drivers. And its compiler still does not allow inline asm, nor are these functions exposed as extensions... Nor do ROCm (which allows inline asm) kernels run within amdgpu-pro (Nor there is a conversion tool to turn ROCm kernels into ones that can be loaded by the other drivers)... (Nor does ROCm support Navi architecture 1 year post launch !!!)
I know this is a little rant, but doing so to point out what huge suffering it is when you try to write kernels for AMD cards and I wonder why such a simple thing as allowing inline asm does not find its way into the opencl compilers that are included into mainstream drivers. You have great hardware, but the software support is horrible.
Anyways... back to my topic:
You see the only way for me to get around this is using the LDS. My kernel has a local array
__local ulong share[wgSize];
Since the threads that share data are in a block of 4, they are always within the same wavefront - independent of the wave64 or wave32 mode. So I assumed they run in lock-steps and would expect the following pattern to give correct results:
Code A:
if (threads_active) {
// 3 threads write to LDS
if ((lId & 0x3) > 0) share[lId] = dataIn;
//Thread 0 / 4 reading from LDS
if ((lId & 0x3) == 0) sum = dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}
Well - I see there may be optimizations where this fails, because the compiler may reorganize the access in wrong order.
So I also did a more conservative variant:
Code B:
if (threads_active) {
// 3 threads write to LDS
if ((lId & 0x3) > 0) share[lId] = dataIn;
mem_fence(CLK_LOCAL_MEM_FENCE);
//Thread 0 / 4 reading from LDS
if ((lId & 0x3) == 0) sum = dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}
And well there is a 3rd version - that is slower of course and should be over-careful
Code C:
if (threads_active) {
// 3 threads write to LDS
if ((lId & 0x3) > 0) share[lId] = dataIn;
}
barrier(CLK_LOCAL_MEM_FENCE);
if (threads_active) {
//Thread 0 / 4 reading from LDS
if ((lId & 0x3) == 0) sum = dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}
Now my observation:
On Polaris (580) and Vega (including VII) GPUs all three code variants A, B & C produce valid results as expected.
On my RX 5700 only variant C gives the correct calculation results, while A & B shows a divergence in results - that I did not expect since as mentioned the lock stepping should avoid the problems. Tested driver is amdgpu-pro 20.10 and uses the included compiler.
My questions:
1) Why does code B do not work for Navi?
2) When will you finally allow cross lane ops or inline assembly outside of ROCm? Sorry to say so, but that is more then overdue!!!