cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

arsenm
Adept III

Re: Deadlock on 7970. Global memory consistency problem?

I'm not familiar with octree-building algorithms, but it seems you could also do a per-detail-level kernel iteration and each one takes a read-only queue of regions to process, and outputs a write-only queue of regions left to process at the next detail level.

That sounds like what it does when it traverse the tree in the primary kernel (that takes about 90% of the time). Not sure how to make that work with the construction here though.

notzed wrote:

Each kernel is invoked using a 'persistent thread' algorithm: i.e. max out the parallelism to suit the hardware independent of the work size, and it has a while (index < total) { index += global work size; } loop to consume the remaining work.  You could do a few iterations without having to perform any cpu synchronisation (double-buffering), and just check once in a while to see if it's run out of work.

This is what it does. It locks once a position in the tree is found for an item. Conflicts shouldn't be particularly common. This is for only the 2nd most important kernel that takes about 10% of the time usually. If it runs into a position where there's already a particle there's some additional work to move to the next level.

0 Likes
LeeHowes
Staff
Staff

Re: Deadlock on 7970. Global memory consistency problem?

There will be a doc. I have it in front of me it's just not public yet.

If I understand correctly it will say glc if it's globally coherent (an acquire or release). The atomics show this:

buffer_atomic_cmpswap  v[2:3], v1, s[4:7], 0 offen glc

The waits you were looking at wait for the operation to have committed *somewhere*, but quite where depends on the instruction. This is the side effect of having a relaxed memory coherency system. What we have in GCN is fairly standard and predictable. The problem is that if you reread the mem_fence spec for OpenCL it is a *local* ordering. It does not guarantee that anything commits to global visibility. That is only guaranteed by atomics, or by the end of a kernel's execution.

0 Likes
arsenm
Adept III

Re: Deadlock on 7970. Global memory consistency problem?

LeeHowes wrote:

The waits you were looking at wait for the operation to have committed *somewhere*, but quite where depends on the instruction. This is the side effect of having a relaxed memory coherency system. What we have in GCN is fairly standard and predictable. The problem is that if you reread the mem_fence spec for OpenCL it is a *local* ordering. It does not guarantee that anything commits to global visibility. That is only guaranteed by atomics, or by the end of a kernel's execution.

I'm still confused by a few things. First the documentation for the fence_memory in the IL says specifically that

" all memory export instructions are complete (the data has been written to physical memory, not in the cache) and is visible to other work-items." Is this documentation somewhat inaccurate, or is the IL compiler not obeying this? Alternatively is it actually committed to physical memory, and the other compute units reading from the same address read a stale copy from their private L1 cache?

At least for the sample program I put here, only the atomic reads seem necessary to get it to work. I'm not yet sure about my real problem yet; just replacing the reads with the atomic_or doesn't seem to be working but I'll have to check a few more things later.

0 Likes
notzed
Challenger

Re: Deadlock on 7970. Global memory consistency problem?

You're talking about the isa, but aren't you using opencl?

opencl's memory model is pretty clear: global memory is only consistent amongst work items in the SAME work group.  Even going on the isa info you quoted, just because data has been written to physical memory, it doesn't mean it's been expunged from the caches of all the other units and they will see updates immediately.

3.3.1 Memory Consistency
OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a work- item is not guaranteed to be consistent across the collection of work-items at all times.
Within a work-item memory has load / store consistency. Local memory is consistent across work-items in a single work-group at a work-group barrier. Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.

There are good and 'obvious' reasons why this is so:

a) allows an implementation to run jobs in batches if they wont all fit concurrently on a given piece of hardware.  i.e. you cannot communicate with inactive threads.

b) allows localised memory optimisations such as CELL's LS or local L1 not needing to be globally consistent (potentially a huge performance bottleneck, for very little practical benefit).

i.e. in general, you can't expect a global read/write data structure to work as a communication mechanism between work items.  atomics work because they can be implemented in specialised hardware so they're not too inefficient to use, and they have very limited defined functionality.

The programming model and hardware are what they are, and trying to fit a round peg in a square hole wont make them any different ...

0 Likes