OpenCL

lolliedieb · ‎06-06-2020

Hey there,

I got a code that needs to share data among threads in blocks of 4, so thread i needs to access values from threads (i & 0xFC) + 0 ... (i & 0xFC) + 3.

When writing such a code in GCN / RDNA assembler I would usually use cross lane operations, but sadly I have code that needs to run on the amdgpu-pro drivers. And its compiler still does not allow inline asm, nor are these functions exposed as extensions... Nor do ROCm (which allows inline asm) kernels run within amdgpu-pro (Nor there is a conversion tool to turn ROCm kernels into ones that can be loaded by the other drivers)... (Nor does ROCm support Navi architecture 1 year post launch !!!)

I know this is a little rant, but doing so to point out what huge suffering it is when you try to write kernels for AMD cards and I wonder why such a simple thing as allowing inline asm does not find its way into the opencl compilers that are included into mainstream drivers. You have great hardware, but the software support is horrible.

Anyways... back to my topic:
You see the only way for me to get around this is using the LDS. My kernel has a local array

__local ulong share[wgSize];

Since the threads that share data are in a block of 4, they are always within the same wavefront - independent of the wave64 or wave32 mode. So I assumed they run in lock-steps and would expect the following pattern to give correct results:
Code A:

if (threads_active) {
   // 3 threads write to LDS
   if ((lId & 0x3) > 0) share[lId] = dataIn;

   //Thread 0 / 4 reading from LDS
    if ((lId & 0x3) == 0) sum = dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}

Well - I see there may be optimizations where this fails, because the compiler may reorganize the access in wrong order.
So I also did a more conservative variant:

Code B:

if (threads_active) {
   // 3 threads write to LDS
   if ((lId & 0x3) > 0) share[lId] = dataIn;
   mem_fence(CLK_LOCAL_MEM_FENCE);
   //Thread 0 / 4 reading from LDS
    if ((lId & 0x3) == 0) sum = dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}

And well there is a 3rd version - that is slower of course and should be over-careful

Code C:

if (threads_active) {
   // 3 threads write to LDS
   if ((lId & 0x3) > 0) share[lId] = dataIn;
}
barrier(CLK_LOCAL_MEM_FENCE);
if (threads_active) {
   //Thread 0 / 4 reading from LDS
    if ((lId & 0x3) == 0) sum = dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}

Now my observation:

On Polaris (580) and Vega (including VII) GPUs all three code variants A, B & C produce valid results as expected.
On my RX 5700 only variant C gives the correct calculation results, while A & B shows a divergence in results - that I did not expect since as mentioned the lock stepping should avoid the problems. Tested driver is amdgpu-pro 20.10 and uses the included compiler.

My questions:
1) Why does code B do not work for Navi?
2) When will you finally allow cross lane ops or inline assembly outside of ROCm? Sorry to say so, but that is more then overdue!!!

lolliedieb · ‎06-06-2020

Just to add for the sake of completeness: Of course my kernel is overall more complex then this - the addition was just an example and it also happens that all threads in the group of 4 push data to their share space and then right after this start the calculation of a common value based on the just pushed values. But the pattern is always like done in the example.

dipak · ‎06-23-2020

Why does code B do not work for Navi?

We use a new compiler (LC) for Navi. The compiler team suspects that LC probably does a lot more optimization of IR than HSAIL does and it might be a reason behind this difference.

Below is their feedback regarding the above code snippet.

"In code A, there is no reason for the compiler to not reorder the two if statements since the compiler can probably prove they are disjoint.

In code B, that OpenCL 1.2 mem_fence() orders loads & stores of a work-item. And again, since the loads and stores are provably disjoint, there is no way that the thread can observe that the fence did not do its job correctly.

The only correct OpenCL 1.2 code here is code C."

When will you finally allow cross lane ops or inline assembly outside of ROCm?

Sorry we can't provide any timeline at this moment.

One point to note though. We do not have Navi support on HSAIL path and there is no plan to add DPP feature in that stack.

Thanks.

OpenCL

Missing lock step behaviour of Navi GPUs?