AnsweredAssumed Answered

Deadlock on 7970. Global memory consistency problem?

Question asked by arsenm on Apr 3, 2012
Latest reply on Apr 6, 2012 by notzed

I'm having a problem where if I have more than 1 workgroup active, and multiple workitems attempt to lock the same (global int) address, I experience deadlock. I can let it make millions of attempts, and none of them ever succeed. The same code works fine on Cypress, Cayman, as well as Nvidia GT200 and Fermi. I'm wondering what might have changed, especially since there doesn't seem to be hardware documentation available yet.

 

I'm wondering if somehow a global memory consistency issue is showing up. I know that technically in the OpenCL specification global memory consistency is only guaranteed between items within the same workgroup, but I am already targeting / utilizing several AMD and Nvidia GPU hardware details for maximum performance.

 

The code looks something like this (inside a loop to retry later on lock failures)

 

if (ch != lockvalue)
{
     if (ch == atom_cmpxchg(&something[locked]), ch, lockvalue))
    {
        // something to determine value
        something[locked] = value;
        mem_fence(CLK_GLOBAL_MEM_FENCE);
    }
}

 

In the IL the mem_fence compiles to a fence_memory instruction, which according to the IL documentation says that

_memory - global/scatter memory fence. It ensures that:   

 

- no memory import/export instructions can be re-ordered or moved across this fence

instruction.


- all memory export instructions are complete (the data has been written to physical

memory, not in the cache) and is visible to other work-items.

 

I was wondering if there was some new global memory caching behaviour across compute units, but my reading of this about the fence_memory makes me think this would not be a problem, since this is supposed to guarantee the data is not in a cache. It might be nice if this documentation snippet explicitly says if visible to other work-items is restricted to workitems in the same workgroup or across the device.

 

The fence looks like it compiles in the Tahiti ISA to one of

s_waitcnt     vmcnt(0) & expcnt(0)     

s_waitcnt     vmcnt(0)                                 

s_waitcnt     expcnt(0)

 

But since there isn't hardware documentation yet I'm only guessing that this means wait until some memory access completes, and the expcnt is for writes and the vmcnt is for reads.

Outcomes