Archives Discussions

sh2 · ‎05-08-2012

The following code seems to be miscompiled:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
int f(int a) {
    return as_int(as_float(a)+1.0f);
}
__kernel
void test(__global int * ptr) {
    int value;
    int new_value;
    do {
        mem_fence(CLK_GLOBAL_MEM_FENCE);
        value = *ptr;
        new_value=f(value);
    } while(value != atom_cmpxchg(ptr, value, new_value));
}

shader main

asic(SI_ASIC)

type(CS)

s_buffer_load_dword s0, s[8:11], 0x00 // 00000000: C2000900

label_0001:

s_waitcnt expcnt(0) // 00000004: BF8C1F0F

s_waitcnt lgkmcnt(0) // 00000008: BF8C007F

v_mov_b32 v0, s0 // 0000000C: 7E000200

tbuffer_load_format_x v5, v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000010: EBA01000 80010500

s_waitcnt vmcnt(0) // 00000018: BF8C1F70

v_add_f32 v4, 1.0, v5 // 0000001C: 06080AF2

v_mov_b32 v2, v4 // 00000020: 7E040304

v_mov_b32 v3, v5 // 00000024: 7E060305

buffer_atomic_cmpswap v[2:3], v0, s[4:7], 0 offen glc // 00000028: E0C45000 80010200

s_waitcnt vmcnt(0) // 00000030: BF8C1F70

v_cmp_eq_i32 vcc, v5, v2 // 00000034: 7D040505

s_cbranch_vccz label_0001 // 00000038: BF86FFF2

s_endpgm // 0000003C: BF810000

end

The compiler doesn't update execution mask and atom_cmpxchg executing more times than it should.

LeeHowes · ‎06-04-2012

Sorry about the slow responses. I'm just too busy to actually verify this myself (I don't work with OpenCL day to day). However I have entered it into the system as a bug report against the shader compiler. They will hopefully verify against the latest SC and if there is a bug fix it.

Lee

View solution in original post

LeeHowes · ‎05-08-2012

Is it actually compiling wrongly or is it just caching the read of value and hence always giving you the wrong behaviour? Both of the reads are not globally coherent, so they may read from L1. I don't know what the behaviour of the atomic operation is with respect to L1, but at the very least if one lane updates it atomically that is no guarantee that other lanes will pick that value up until after their atomic read.

Does it fix the behaviour to make value = *ptr an atomic read? Replace that line with "value = atomic_add(ptr, 0)".

It does look to not be updating the masks but it may be that that's a side effect of assumptions it knows it can make about the read operation.

sh2 · ‎05-08-2012

>Does it fix the behaviour to make value = *ptr an atomic read? Replace that line with "value = atomic_add(ptr, 0)".

It doesn't change anything. So I assume it isn't coherence issue.

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable  
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable  
int f(int a) {  
    return as_int(as_float(a)+1.0f);  
}  
__kernel  
void test(__global int * ptr) {  
   int value;  
   int new_value;  
   do {  
       mem_fence(CLK_GLOBAL_MEM_FENCE);  //Not needed actually
       value = atom_add(ptr, 0);              
       new_value=f(value);  
   } while(value != atom_cmpxchg(ptr, value, new_value));  
}

shader main

asic(SI_ASIC)

type(CS)

s_buffer_load_dword s0, s[8:11], 0x00 // 00000000: C2000900

label_0001:

s_waitcnt expcnt(0) // 00000004: BF8C1F0F

s_waitcnt lgkmcnt(0) // 00000008: BF8C007F

v_mov_b32 v0, s0 // 0000000C: 7E000200

v_mov_b32 v1, 0 // 00000010: 7E020280

buffer_atomic_add v1, v0, s[4:7], 0 offen glc // 00000014: E0C85000 80010100

s_waitcnt vmcnt(0) // 0000001C: BF8C1F70

v_add_f32 v2, 1.0, v1 // 00000020: 060402F2

v_mov_b32 v3, v1 // 00000024: 7E060301

buffer_atomic_cmpswap v[2:3], v0, s[4:7], 0 offen glc // 00000028: E0C45000 80010200

s_waitcnt vmcnt(0) // 00000030: BF8C1F70

v_cmp_eq_i32 vcc, v1, v2 // 00000034: 7D040501

s_cbranch_vccz label_0001 // 00000038: BF86FFF2

s_endpgm // 0000003C: BF810000

end

sh2 · ‎05-10-2012

Atomic_add doesn't change anything.

Could you confirm(or reject) the bug?

drallan · ‎05-21-2012

sh,

I agree it does look like a bug where the Tahiti compiler should be using the exec mask but doesn't. This is easy to test because for Cayman the compiler does use the exec mask and the loop works correctly, looping 64 times (for a single wave), while on Tahiti code fails to loop. The code should execute the same on both platforms. For simplicity, I replaced the f(a) function and the code becomes

__kernel void test(__global int * ptr) {

int value; int new_value;

do{

value = atomic_add(ptr, 0);

new_value=value+1;

} while(value != atom_cmpxchg(ptr, value, new_value));

}

During each loop, the first thread to execute atomic_cmpxchg() changes the value of *ptr causing subsequent threads to fail atomic_cmpxchg() and to loop. The Cayman code uses the exec mask to track which threads are active/looping, and loops until all 64 bits of exec are set to 0. The Tahiti code can only branch if vcc == 0 , which doesn't happen so the code never loops.

Another test. You can make the code work correctly on Tahiti by forcing the compiler to use the exec mask, by placing this bogus line just before the do loop,

ptr=ptr+(get_local_id(0)&0xff000000);

This prevents the compiler from knowing the value of ptr so it will use the exec mask. In reality, ptr is not changed.

drallan

//Disassembly without ptr=ptr+(get_local_id(0)&0xff000000);

s_buffer_load_dword s0, s[8:11], 0x00

label_0001:

s_waitcnt 0x0000

v_mov_b32 v0, s0

v_mov_b32 v1, 0

buffer_atomic_add v1, v0, s[4:7], 0 offen glc

s_waitcnt vmcnt(0)

v_add_i32 v2, vcc, 1, v1

v_mov_b32 v3, v1

buffer_atomic_cmpswap v[2:3], v0, s[4:7], 0 offen glc

s_waitcnt vmcnt(0)

v_cmp_eq_i32 vcc, v1, v2

s_cbranch_vccz label_0001

s_branch label_0010

s_branch label_0001

label_0010:

s_endpgm

//Disassembly with ptr=ptr+(get_local_id(0)&0xff000000);

label_000E:

s_waitcnt 0x0000

v_mov_b32 v1, 0

buffer_atomic_add v1, v0, s[4:7], 0 offen glc

s_waitcnt vmcnt(0)

v_add_i32 v2, vcc, 1, v1

v_mov_b32 v3, v1

buffer_atomic_cmpswap v[2:3], v0, s[4:7], 0 offen glc

s_waitcnt vmcnt(0)

v_cmp_eq_i32 s[8:9], v1, v2

s_mov_b64 s[10:11], exec

s_and_b64 exec, s[10:11], s[8:9]

s_andn2_b64 s[2:3], s[2:3], exec

s_cbranch_scc0 label_0021

s_mov_b64 exec, s[10:11]

s_and_b64 exec, exec, s[2:3]

s_branch label_000E

label_0021:

s_mov_b64 exec, s[0:1]

s_endpgm

sh2 · ‎05-21-2012

Thanks for the workaround. Let's hope this will be fixed soon.

LeeHowes · ‎06-04-2012

Sorry about the slow responses. I'm just too busy to actually verify this myself (I don't work with OpenCL day to day). However I have entered it into the system as a bug report against the shader compiler. They will hopefully verify against the latest SC and if there is a bug fix it.

Lee

Archives Discussions

Possible bug with atom_cmpxchg