Archives Discussions

arsenm · ‎04-03-2012

I'm having a problem where if I have more than 1 workgroup active, and multiple workitems attempt to lock the same (global int) address, I experience deadlock. I can let it make millions of attempts, and none of them ever succeed. The same code works fine on Cypress, Cayman, as well as Nvidia GT200 and Fermi. I'm wondering what might have changed, especially since there doesn't seem to be hardware documentation available yet.

I'm wondering if somehow a global memory consistency issue is showing up. I know that technically in the OpenCL specification global memory consistency is only guaranteed between items within the same workgroup, but I am already targeting / utilizing several AMD and Nvidia GPU hardware details for maximum performance.

The code looks something like this (inside a loop to retry later on lock failures)

if (ch != lockvalue)
{
     if (ch == atom_cmpxchg(&something[locked]), ch, lockvalue))
    {
        // something to determine value
        something[locked] = value;
        mem_fence(CLK_GLOBAL_MEM_FENCE);
    }
}

In the IL the mem_fence compiles to a fence_memory instruction, which according to the IL documentation says that

_memory - global/scatter memory fence. It ensures that:
- no memory import/export instructions can be re-ordered or moved across this fence
instruction.

- all memory export instructions are complete (the data has been written to physical
memory, not in the cache) and is visible to other work-items.

I was wondering if there was some new global memory caching behaviour across compute units, but my reading of this about the fence_memory makes me think this would not be a problem, since this is supposed to guarantee the data is not in a cache. It might be nice if this documentation snippet explicitly says if visible to other work-items is restricted to workitems in the same workgroup or across the device.

The fence looks like it compiles in the Tahiti ISA to one of

s_waitcnt vmcnt(0) & expcnt(0)

s_waitcnt vmcnt(0)

s_waitcnt expcnt(0)

But since there isn't hardware documentation yet I'm only guessing that this means wait until some memory access completes, and the expcnt is for writes and the vmcnt is for reads.

MicahVillmow · ‎04-03-2012

        // something to determine value

        something[locked] = value;

This seems to be your problem.

On the new architecture, memory is cached by default, on previous it was uncached by default. Does it work if you declare something as a volatile pointer?

arsenm · ‎04-03-2012

MicahVillmow wrote:
On the new architecture, memory is cached by default, on previous it was uncached by default. Does it work if you declare something as a volatile pointer?

It already was volatile (In that example, it is __global volatile int* something).

MicahVillmow · ‎04-03-2012

Ok, if you can supply a simple CL kernel that shows the problem, I can investigate it more to see why it is hanging the machine.

arsenm · ‎04-03-2012

Here is an example program. Since hanging the machine is irritating, this sample gives up after failing to acquire the lock 100000 times. It then prints how many failed to acquire the lock.

Every item tries to acquire the same lock. Each one should eventually succeed, and will if you only have 1 workgroup.

I also discovered that if you have a printf anywhere in the kernel it works. I also found that the number workitems beyond the N that are expected to do useful work in the workgroup is reported as wrong unless the printf is there, in which case the printf and the buffer at the end agree and are consistent with what should happen. Without the printf the first workgroups incorrectly report non-zero numbers of excess items. Only the last one should have extra.

LeeHowes · ‎04-03-2012

Is there any reason to assume the compiler won't re-order accesses to that variable that aren't atomic independently of ones that are? Does it work any better if you put fences both sides of all accesses to the atomic variable?

arsenm · ‎04-03-2012

I've tried spamming mem_fences but it doesn't change anything.

LeeHowes · ‎04-04-2012

What about replacing the assignment of 0 by an atomic assignment? An atomic_and with 0, say.

arsenm · ‎04-04-2012

The same thing happens using atomic_or(&child[locked], 0); in the test program in place of the assignment.

LeeHowes · ‎04-04-2012

Well that would certainly break wouldn't it? You need atomic_and. x & 0 == 0

arsenm · ‎04-04-2012

That's what I meant. I tried atomic_and and typed atomic_or here.

notzed · ‎04-03-2012

I'm curious as to what algorithms or data structures you want this for?

arsenm · ‎04-04-2012

This is from the octree construction for the Barnes-Hut n-body algorithm.

sh2 · ‎04-04-2012

It is not memory consistency problem. Nor hardware/compiler problem.

You implicitly assume that neighbor threads are executed independently.

See this paper for more information (chapter "Effects of Serialization due to SIMT"): http://www.stuffedcow.net/files/gpuarch-ispass2010.pdf

LeeHowes · ‎04-04-2012

This is why using the word "thread" for work items was not one of the industry's greatest nomenclature choices.

However, in this case I think from the fact that the barrier is outside the divergent flow and the code works on previous architectures, that issue is understood.

sh2 · ‎04-04-2012

Code that causes deadlock:

while(...)
{
     ...
     if(old == cmp_exch(ptr, old, -1))   //take a lock
     {
          ....
          *ptr = -1; //release lock
     }
}

There is thread divergence in line 4. Some compiler/hardware may decide to execute threads without lock first. So deadlock.

arsenm · ‎04-04-2012

The order things acquire the lock in doesn't matter if some items with or without the lock run first. Each item acquires the lock once and then releases it. They all keep retrying until each one acquires the lock in turn, and then stop. It should continue until each item finishes taking the lock once. I also only care specifically about what this hardware does. In this case there only is a problem when running multiple workgroups.

In your example deadlock, when releasing the lock, replaces what was there with the same lock value so it doesn't actually release the lock. It also has nothing to prevent items when they give up the lock from trying to take it again, so your example isn't really similar to this case.

LeeHowes · ‎04-04-2012

What happens if you spread the child array out more? Make it 16 times the size and multiply all your accesses by 16. Maybe you're getting cache line interference between L1 caches on the non-atomic reads and writes and so you're losing them completely. I would have thought your atomic change would fix that but it's worth a check.

Also, just in case you didn't try this, make the read atomic too. Make sure every access to child is atomic to force it to push every operation out to L2.

There's probably something really obvious here and someone will point out how stupid we all are

Looking at the ISA, you're loading the lock value here:

tbuffer_load_format_x v5, v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000110: EBA01000 80010500

Which, if I read it correctly, is not a globally coherent load. It's pulling from L1, not L2. So if the atomic doesn't guarantee to clear the line from L1. Your write to clear the lock is the same, it's only going to L1. The fences do not appear to be changing that. If that write never hits L2 you may be getting into a cycle of no other work items being able to correctly read it.

arsenm · ‎04-04-2012

LeeHowes wrote:
What happens if you spread the child array out more? Make it 16 times the size and multiply all your accesses by 16. Maybe you're getting cache line interference between L1 caches on the non-atomic reads and writes and so you're losing them completely. I would have thought your atomic change would fix that but it's worth a check.

That didn't change anything but I didn't really fully explore that.

Also, just in case you didn't try this, make the read atomic too. Make sure every access to child is atomic to force it to push every operation out to L2.

The example works if I turn the read into atomic_or(&child[locked], 0). It works whether or not I make the write to release the lock atomic.

Looking at the ISA, you're loading the lock value here:
tbuffer_load_format_x v5, v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000110: EBA01000 80010500
Which, if I read it correctly, is not a globally coherent load. It's pulling from L1, not L2. So if the atomic doesn't guarantee to clear the line from L1. Your write to clear the lock is the same, it's only going to L1. The fences do not appear to be changing that. If that write never hits L2 you may be getting into a cycle of no other work items being able to correctly read it.

Which part shows that? The s[4:7]? As far as I can tell there isn't documentation yet for the new ISA. Is there a better way to avoid that that doesn't use an atomic?

LeeHowes · ‎04-05-2012

There will be a doc. I have it in front of me it's just not public yet.

If I understand correctly it will say glc if it's globally coherent (an acquire or release). The atomics show this:

buffer_atomic_cmpswap v[2:3], v1, s[4:7], 0 offen glc

The waits you were looking at wait for the operation to have committed *somewhere*, but quite where depends on the instruction. This is the side effect of having a relaxed memory coherency system. What we have in GCN is fairly standard and predictable. The problem is that if you reread the mem_fence spec for OpenCL it is a *local* ordering. It does not guarantee that anything commits to global visibility. That is only guaranteed by atomics, or by the end of a kernel's execution.

arsenm · ‎04-05-2012

LeeHowes wrote:
The waits you were looking at wait for the operation to have committed *somewhere*, but quite where depends on the instruction. This is the side effect of having a relaxed memory coherency system. What we have in GCN is fairly standard and predictable. The problem is that if you reread the mem_fence spec for OpenCL it is a *local* ordering. It does not guarantee that anything commits to global visibility. That is only guaranteed by atomics, or by the end of a kernel's execution.

I'm still confused by a few things. First the documentation for the fence_memory in the IL says specifically that

" all memory export instructions are complete (the data has been written to physical memory, not in the cache) and is visible to other work-items." Is this documentation somewhat inaccurate, or is the IL compiler not obeying this? Alternatively is it actually committed to physical memory, and the other compute units reading from the same address read a stale copy from their private L1 cache?

At least for the sample program I put here, only the atomic reads seem necessary to get it to work. I'm not yet sure about my real problem yet; just replacing the reads with the atomic_or doesn't seem to be working but I'll have to check a few more things later.

notzed · ‎04-06-2012

You're talking about the isa, but aren't you using opencl?

opencl's memory model is pretty clear: global memory is only consistent amongst work items in the SAME work group. Even going on the isa info you quoted, just because data has been written to physical memory, it doesn't mean it's been expunged from the caches of all the other units and they will see updates immediately.

3.3.1 Memory Consistency

OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a work- item is not guaranteed to be consistent across the collection of work-items at all times.

Within a work-item memory has load / store consistency. Local memory is consistent across work-items in a single work-group at a work-group barrier. Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.

There are good and 'obvious' reasons why this is so:

a) allows an implementation to run jobs in batches if they wont all fit concurrently on a given piece of hardware. i.e. you cannot communicate with inactive threads.

b) allows localised memory optimisations such as CELL's LS or local L1 not needing to be globally consistent (potentially a huge performance bottleneck, for very little practical benefit).

i.e. in general, you can't expect a global read/write data structure to work as a communication mechanism between work items. atomics work because they can be implemented in specialised hardware so they're not too inefficient to use, and they have very limited defined functionality.

The programming model and hardware are what they are, and trying to fit a round peg in a square hole wont make them any different ...

notzed · ‎04-04-2012

sh, he's only talking about a specific hardware/compiler combination, not the general case. Given we know(?) that the 7970 will execute 64 workgroups concurrently without such local-thread decisions, it 'should' work. Personally I don't think it's a good approach, but it's up to him/her

I would try to find another algorithm, this per-work-item locking is by definition going to kill performance even if it does work. Even just using global atomics can be a killer on some hardware (that's why amd have an extension for atomic counters which go into specialised functional units).

e.g. a queue can be implemented as an array with two atomic counters with no spin locks, you just have to use a big enough memory to hold all items it might receive. Although a read/write queue is probably not a great idea as there's no way to communicate data written to other CUs since global barriers are only per-work-group/(CU?).

I'm not familiar with octree-building algorithms, but it seems you could also do a per-detail-level kernel iteration and each one takes a read-only queue of regions to process, and outputs a write-only queue of regions left to process at the next detail level.

Each kernel is invoked using a 'persistent thread' algorithm: i.e. max out the parallelism to suit the hardware independent of the work size, and it has a while (index < total) { index += global work size; } loop to consume the remaining work. You could do a few iterations without having to perform any cpu synchronisation (double-buffering), and just check once in a while to see if it's run out of work.

arsenm · ‎04-04-2012

I'm not familiar with octree-building algorithms, but it seems you could also do a per-detail-level kernel iteration and each one takes a read-only queue of regions to process, and outputs a write-only queue of regions left to process at the next detail level.

That sounds like what it does when it traverse the tree in the primary kernel (that takes about 90% of the time). Not sure how to make that work with the construction here though.

notzed wrote:
Each kernel is invoked using a 'persistent thread' algorithm: i.e. max out the parallelism to suit the hardware independent of the work size, and it has a while (index < total) { index += global work size; } loop to consume the remaining work. You could do a few iterations without having to perform any cpu synchronisation (double-buffering), and just check once in a while to see if it's run out of work.

This is what it does. It locks once a position in the tree is found for an item. Conflicts shouldn't be particularly common. This is for only the 2nd most important kernel that takes about 10% of the time usually. If it runs into a position where there's already a particle there's some additional work to move to the next level.

Archives Discussions

Deadlock on 7970. Global memory consistency problem?