23 Replies Latest reply on Apr 6, 2012 4:05 PM by notzed

    Deadlock on 7970. Global memory consistency problem?

    arsenm

      I'm having a problem where if I have more than 1 workgroup active, and multiple workitems attempt to lock the same (global int) address, I experience deadlock. I can let it make millions of attempts, and none of them ever succeed. The same code works fine on Cypress, Cayman, as well as Nvidia GT200 and Fermi. I'm wondering what might have changed, especially since there doesn't seem to be hardware documentation available yet.

       

      I'm wondering if somehow a global memory consistency issue is showing up. I know that technically in the OpenCL specification global memory consistency is only guaranteed between items within the same workgroup, but I am already targeting / utilizing several AMD and Nvidia GPU hardware details for maximum performance.

       

      The code looks something like this (inside a loop to retry later on lock failures)

       

      if (ch != lockvalue)
      {
           if (ch == atom_cmpxchg(&something[locked]), ch, lockvalue))
          {
              // something to determine value
              something[locked] = value;
              mem_fence(CLK_GLOBAL_MEM_FENCE);
          }
      }
      

       

      In the IL the mem_fence compiles to a fence_memory instruction, which according to the IL documentation says that

      _memory - global/scatter memory fence. It ensures that:   

       

      - no memory import/export instructions can be re-ordered or moved across this fence

      instruction.


      - all memory export instructions are complete (the data has been written to physical

      memory, not in the cache) and is visible to other work-items.

       

      I was wondering if there was some new global memory caching behaviour across compute units, but my reading of this about the fence_memory makes me think this would not be a problem, since this is supposed to guarantee the data is not in a cache. It might be nice if this documentation snippet explicitly says if visible to other work-items is restricted to workitems in the same workgroup or across the device.

       

      The fence looks like it compiles in the Tahiti ISA to one of

      s_waitcnt     vmcnt(0) & expcnt(0)     

      s_waitcnt     vmcnt(0)                                 

      s_waitcnt     expcnt(0)

       

      But since there isn't hardware documentation yet I'm only guessing that this means wait until some memory access completes, and the expcnt is for writes and the vmcnt is for reads.

        • Re: Deadlock on 7970. Global memory consistency problem?
          MicahVillmow
                  // something to determine value
                  something[locked] = value;

          This seems to be your problem.

           

           

          On the new architecture, memory is cached by default, on previous it was uncached by default. Does it work if you declare something as a volatile pointer?

          • Re: Deadlock on 7970. Global memory consistency problem?
            notzed

            I'm curious as to what algorithms or data structures you want this for?

            • Re: Deadlock on 7970. Global memory consistency problem?
              sh

              It is not memory consistency problem. Nor hardware/compiler problem.

              You implicitly assume that neighbor threads are executed independently. 

              See this paper for more information (chapter "Effects of Serialization due to SIMT"): http://www.stuffedcow.net/files/gpuarch-ispass2010.pdf

                • Re: Deadlock on 7970. Global memory consistency problem?
                  LeeHowes

                  This is why using the word "thread" for work items was not one of the industry's greatest nomenclature choices.

                   

                  However, in this case I think from the fact that the barrier is outside the divergent flow and the code works on previous architectures, that issue is understood.

                    • Re: Deadlock on 7970. Global memory consistency problem?
                      sh

                      Code that causes deadlock:

                      while(...)
                      {
                           ...
                           if(old == cmp_exch(ptr, old, -1))   //take a lock
                           {
                                ....
                                *ptr = -1; //release lock
                           }
                      }
                      

                      There is thread divergence in line 4. Some compiler/hardware may decide to execute threads without lock first. So deadlock.

                        • Re: Deadlock on 7970. Global memory consistency problem?
                          arsenm

                          The order things acquire the lock in doesn't matter if some items with or without the lock run first. Each item acquires the lock once and then releases it. They all keep retrying until each one acquires the lock in turn, and then stop. It should continue until each item finishes taking the lock once. I also only care specifically about what this hardware does. In this case there only is a problem when running multiple workgroups.

                           

                          In your example deadlock, when releasing the lock, replaces what was there  with the same lock value so it doesn't actually release the lock. It also has nothing to prevent items when they give up the lock from trying to take it again, so your example isn't really similar to this case.

                            • Re: Deadlock on 7970. Global memory consistency problem?
                              LeeHowes

                              What happens if you spread the child array out more? Make it 16 times the size and multiply all your accesses by 16. Maybe you're getting cache line interference between L1 caches on the non-atomic reads and writes and so you're losing them completely. I would have thought your atomic change would fix that but it's worth a check.

                               

                              Also, just in case you didn't try this, make the read atomic too. Make sure every access to child is atomic to force it to push every operation out to L2.

                               

                              There's probably something really obvious here and someone will point out how stupid we all are

                               

                              Looking at the ISA, you're loading the lock value here:

                              tbuffer_load_format_x  v5, v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000110: EBA01000 80010500

                               

                              Which, if I read it correctly, is not a globally coherent load. It's pulling from L1, not L2. So if the atomic doesn't guarantee to clear the line from L1. Your write to clear the lock is the same, it's only going to L1. The fences do not appear to be changing that. If that write never hits L2 you may be getting into a cycle of no other work items being able to correctly read it.

                              1 of 1 people found this helpful
                                • Re: Deadlock on 7970. Global memory consistency problem?
                                  arsenm

                                  LeeHowes wrote:

                                   

                                  What happens if you spread the child array out more? Make it 16 times the size and multiply all your accesses by 16. Maybe you're getting cache line interference between L1 caches on the non-atomic reads and writes and so you're losing them completely. I would have thought your atomic change would fix that but it's worth a check.

                                  That didn't change anything but I didn't really fully explore that.

                                   

                                  Also, just in case you didn't try this, make the read atomic too. Make sure every access to child is atomic to force it to push every operation out to L2.

                                  The example works if I turn the read into atomic_or(&child[locked], 0). It works whether or not I make the write to release the lock atomic.

                                   

                                  Looking at the ISA, you're loading the lock value here:

                                  tbuffer_load_format_x  v5, v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000110: EBA01000 80010500

                                   

                                  Which, if I read it correctly, is not a globally coherent load. It's pulling from L1, not L2. So if the atomic doesn't guarantee to clear the line from L1. Your write to clear the lock is the same, it's only going to L1. The fences do not appear to be changing that. If that write never hits L2 you may be getting into a cycle of no other work items being able to correctly read it.

                                  Which part shows that? The s[4:7]? As far as I can tell there isn't documentation yet for the new ISA. Is there a better way to avoid that that doesn't use an atomic?

                                    • Re: Deadlock on 7970. Global memory consistency problem?
                                      LeeHowes

                                      There will be a doc. I have it in front of me it's just not public yet.

                                       

                                      If I understand correctly it will say glc if it's globally coherent (an acquire or release). The atomics show this:

                                      buffer_atomic_cmpswap  v[2:3], v1, s[4:7], 0 offen glc

                                       

                                      The waits you were looking at wait for the operation to have committed *somewhere*, but quite where depends on the instruction. This is the side effect of having a relaxed memory coherency system. What we have in GCN is fairly standard and predictable. The problem is that if you reread the mem_fence spec for OpenCL it is a *local* ordering. It does not guarantee that anything commits to global visibility. That is only guaranteed by atomics, or by the end of a kernel's execution.

                                        • Re: Deadlock on 7970. Global memory consistency problem?
                                          arsenm

                                          LeeHowes wrote:

                                           

                                          The waits you were looking at wait for the operation to have committed *somewhere*, but quite where depends on the instruction. This is the side effect of having a relaxed memory coherency system. What we have in GCN is fairly standard and predictable. The problem is that if you reread the mem_fence spec for OpenCL it is a *local* ordering. It does not guarantee that anything commits to global visibility. That is only guaranteed by atomics, or by the end of a kernel's execution.

                                          I'm still confused by a few things. First the documentation for the fence_memory in the IL says specifically that

                                          " all memory export instructions are complete (the data has been written to physical memory, not in the cache) and is visible to other work-items." Is this documentation somewhat inaccurate, or is the IL compiler not obeying this? Alternatively is it actually committed to physical memory, and the other compute units reading from the same address read a stale copy from their private L1 cache?

                                           

                                          At least for the sample program I put here, only the atomic reads seem necessary to get it to work. I'm not yet sure about my real problem yet; just replacing the reads with the atomic_or doesn't seem to be working but I'll have to check a few more things later.

                                            • Re: Deadlock on 7970. Global memory consistency problem?
                                              notzed

                                              You're talking about the isa, but aren't you using opencl?

                                               

                                              opencl's memory model is pretty clear: global memory is only consistent amongst work items in the SAME work group.  Even going on the isa info you quoted, just because data has been written to physical memory, it doesn't mean it's been expunged from the caches of all the other units and they will see updates immediately.

                                              3.3.1 Memory Consistency
                                              OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a work- item is not guaranteed to be consistent across the collection of work-items at all times.
                                              Within a work-item memory has load / store consistency. Local memory is consistent across work-items in a single work-group at a work-group barrier. Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.

                                               

                                              There are good and 'obvious' reasons why this is so:

                                              a) allows an implementation to run jobs in batches if they wont all fit concurrently on a given piece of hardware.  i.e. you cannot communicate with inactive threads.

                                              b) allows localised memory optimisations such as CELL's LS or local L1 not needing to be globally consistent (potentially a huge performance bottleneck, for very little practical benefit).

                                               

                                              i.e. in general, you can't expect a global read/write data structure to work as a communication mechanism between work items.  atomics work because they can be implemented in specialised hardware so they're not too inefficient to use, and they have very limited defined functionality.

                                               

                                              The programming model and hardware are what they are, and trying to fit a round peg in a square hole wont make them any different ...

                                    • Re: Deadlock on 7970. Global memory consistency problem?
                                      notzed

                                      sh, he's only talking about a specific hardware/compiler combination, not the general case.  Given we know(?) that the 7970 will execute 64 workgroups concurrently without such local-thread decisions, it 'should' work.   Personally I don't think it's a good approach, but it's up to him/her

                                       

                                      I would try to find another algorithm, this per-work-item locking is by definition going to kill performance even if it does work.  Even just using global atomics can be a killer on some hardware (that's why amd have an extension for atomic counters which go into specialised functional units).

                                       

                                      e.g. a queue can be implemented as an array with two atomic counters with no spin locks, you just have to use a big enough memory to hold all items it might receive.  Although a read/write queue is probably not a great idea as there's no way to communicate data written to other CUs since global barriers are only per-work-group/(CU?).

                                       

                                      I'm not familiar with octree-building algorithms, but it seems you could also do a per-detail-level kernel iteration and each one takes a read-only queue of regions to process, and outputs a write-only queue of regions left to process at the next detail level.

                                       

                                      Each kernel is invoked using a 'persistent thread' algorithm: i.e. max out the parallelism to suit the hardware independent of the work size, and it has a while (index < total) { index += global work size; } loop to consume the remaining work.  You could do a few iterations without having to perform any cpu synchronisation (double-buffering), and just check once in a while to see if it's run out of work.

                                        • Re: Deadlock on 7970. Global memory consistency problem?
                                          arsenm

                                          I'm not familiar with octree-building algorithms, but it seems you could also do a per-detail-level kernel iteration and each one takes a read-only queue of regions to process, and outputs a write-only queue of regions left to process at the next detail level.

                                          That sounds like what it does when it traverse the tree in the primary kernel (that takes about 90% of the time). Not sure how to make that work with the construction here though.

                                           

                                          notzed wrote:

                                          Each kernel is invoked using a 'persistent thread' algorithm: i.e. max out the parallelism to suit the hardware independent of the work size, and it has a while (index < total) { index += global work size; } loop to consume the remaining work.  You could do a few iterations without having to perform any cpu synchronisation (double-buffering), and just check once in a while to see if it's run out of work.

                                          This is what it does. It locks once a position in the tree is found for an item. Conflicts shouldn't be particularly common. This is for only the 2nd most important kernel that takes about 10% of the time usually. If it runs into a position where there's already a particle there's some additional work to move to the next level.