cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

Highlighted
Adept II
Adept II

Missing lock step behaviour of Navi GPUs?

Hey there,

I got a code that needs to share data among threads in blocks of 4, so thread i needs to access values from threads (i & 0xFC) + 0 ... (i & 0xFC) + 3.

When writing such a code in GCN / RDNA assembler I would usually use cross lane operations, but sadly I have code that needs to run on the amdgpu-pro drivers. And its compiler still does not allow inline asm, nor are these functions exposed as extensions... Nor do ROCm (which allows inline asm) kernels run within amdgpu-pro (Nor there is a conversion tool to turn ROCm kernels into ones that can be loaded by the other drivers)... (Nor does ROCm support Navi architecture 1 year post launch !!!)

I know this is a little rant, but doing so to point out what huge suffering it is when you try to write kernels for AMD cards and I wonder why such a simple thing as allowing inline asm does not find its way into the opencl compilers that are included into mainstream drivers. You have great hardware, but the software support is horrible.

Anyways... back to my topic:
You see the only way for me to get around this is using the LDS.  My kernel has a local array

__local ulong share[wgSize];

Since the threads that share data are in a block of 4, they are always within the same wavefront - independent of the wave64 or wave32 mode. So I assumed they run in lock-steps and would expect the following pattern to give correct results:
Code A:


if (threads_active) {

   // 3 threads write to LDS
   if ((lId & 0x3) > 0) share[lId] = dataIn;


   //Thread 0 / 4 reading from LDS

    if ((lId & 0x3) == 0)  sum =  dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}


Well - I see there may be optimizations where this fails, because the compiler may reorganize the access in wrong order.
So I also did a more conservative variant:

Code B:


if (threads_active) {

   // 3 threads write to LDS
   if ((lId & 0x3) > 0) share[lId] = dataIn;

   mem_fence(CLK_LOCAL_MEM_FENCE);

   //Thread 0 / 4 reading from LDS

    if ((lId & 0x3) == 0)  sum =  dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}

And well there is a 3rd version - that is slower of course and should be over-careful

Code C:


if (threads_active) {

   // 3 threads write to LDS
   if ((lId & 0x3) > 0) share[lId] = dataIn;
}

barrier(CLK_LOCAL_MEM_FENCE);

if (threads_active) {

   //Thread 0 / 4 reading from LDS

    if ((lId & 0x3) == 0)  sum =  dataIn + share[lId+1] + share[lId+2] + share[lId+3] ;
}

Now my observation:

On Polaris (580) and Vega (including VII) GPUs all three code variants A, B & C produce valid results as expected.
On my RX 5700 only variant C gives the correct calculation results, while A & B shows a divergence in results - that I did not expect since as mentioned the lock stepping should avoid the problems. Tested driver is amdgpu-pro 20.10 and uses the included compiler.

My questions:
1) Why does code B do not work for Navi?
2) When will you finally allow cross lane ops or inline assembly outside of ROCm? Sorry to say so, but that is more then overdue!!!

0 Kudos
Reply
2 Replies
Highlighted
Adept II
Adept II

Re: Missing lock step behaviour of Navi GPUs?

Just to add for the sake of completeness: Of course my kernel is overall more complex then this - the addition was just an example and it also happens that all threads in the group of 4 push data to their share space and then right after this start the calculation of a common value based on the just pushed values. But the pattern is always like done in the example.

0 Kudos
Reply
Highlighted
Staff
Staff

Re: Missing lock step behaviour of Navi GPUs?

Why does code B do not work for Navi?

We use a new compiler (LC) for Navi.  The compiler team suspects that LC probably does a lot more optimization of IR than HSAIL does and it might be a reason behind this difference. 

Below is their feedback regarding the above code snippet. 

"In code A, there is no reason for the compiler to not reorder the two if statements since the compiler can probably prove they are disjoint.

In code B, that OpenCL 1.2 mem_fence() orders loads & stores of a work-item. And again, since the loads and stores are provably disjoint, there is no way that the thread can observe that the fence did not do its job correctly.

The only correct OpenCL 1.2 code here is code C."

When will you finally allow cross lane ops or inline assembly outside of ROCm?

Sorry we can't provide any timeline at this moment. 

One point to note though. We do not have Navi support on HSAIL path and there is no plan to add DPP feature in that stack.

Thanks.

0 Kudos
Reply