8 Replies Latest reply on Oct 29, 2012 11:50 PM by notzed

    Implementing 64 bit atomics

    chersanya

      I know that no ATI cards and drivers support 64 bit atomics, but they are needed for my project. I've tried to implement them this way:

       

      long _atomic_xchg(volatile __global int3 *p, long val)
      {
          __global long *p_val = (__global long)p;
          __global int *p_lock = (__global int)p + 2;
          while(atomic_cmpxchg(p_lock, 0, 1));
          long old = *p_val;
          *p_val = val;
          atomic_xchg(p_lock, 0);
          *p_lock = 0;
          return old;
      }
      
      __kernel void build_hashtable(
          __global int3 *table)
      {    
          _atomic_xchg(table, 1);
      }
      

      This contains _atomic_xchg for long (actually it gets a pointer to int3, where the first two ints are the value, and the third is used for locking) and simple kernel to test this. However, it hangs up and the driver restarts. It happens even for small global work sizes, like 64, and I don't see the idea how to fix it; atomic_xchg is the only function needed.

        • Re: Implementing 64 bit atomics
          LeeHowes

          A work item is NOT a thread. There is no independent program counter. You are programming for a SIMD architecture, so you have to realise that if one work item is spinning in the while loop, 64 for of them are together. If the one that grabbed the lock is in the same wavefront as one that is spinning (which will always be the case unless only one work item enters that function) then it cannot progress and you have deadlock.

           

          Try moving the while loop out, something like this:

          bool done = false;

          while(!done) {

          a = atomic_cmpxchg(...);

          if( a ) {

          ... do the operation because you own the lock

          done = true;

          }

          }

           

          I think the logic there works... but maybe not. You should be able to do something along those lines, though. Once all lanes have performed the atomic, the wavefront can move on.

          1 of 1 people found this helpful
            • Re: Implementing 64 bit atomics
              chersanya

              Hm, I thought that when a wavefront executes instructions, which are not needed for current work item (because of a conditional expression), this work item just does nops. Actually they do the same operations, but their results are discarded?

               

              Yes, similar logic seems to work for me. After asking the question, I've found another similar discussion and modified the function:

               

              long _atomic_xchg(volatile __global int3 *p, long val)
              {
                  __global long *p_val = (__global long*)p;
                  __global int *p_lock = (__global int*)p + 2;
                  
                  long old;
                  int wait = 1;
                  while (wait)
                  {
                      if (atomic_xchg(p_lock, 1))
                      {
                          old = *p_val;
                          *p_val = val;
                          atomic_xchg(p_lock, 0);
                          wait = 0;
                      }
                  }
                  return old;
              }
              

              Now it works. However, just curious, is it possible to remove the need of having additional field to lock on?

                • Re: Implementing 64 bit atomics
                  LeeHowes

                  no-op or discarded output is the same, really. How masked operations are implemented depends on the architecture. If you think about how you'd do it on SSE where you can't mask arithmetic operations you'd have a masked copy (an and, say) into an output register at the end of the divergent block. The GPU may do that, it may switch lanes off for lower power consumption.

                   

                  I doubt you can get rid of the lock because without it you have no way to guarantee that the two sub-operations work atomically. Actually, in the code above you may still not get the right answer, I think. The compiler or hardware may not enforce visbility of operations on p_val in the way you expect with respect to the atomic operations. OpenCL's memory model is very weak, like C's, and offers little guarantee of ordering across work items.

                    • Re: Implementing 64 bit atomics
                      chersanya

                      And is there a way to implement such atomics so that they are guaranteed to work? May be slower, but work. If it's not in general case, what about atomics in a single work-group? Or even wavefront?

                      Honestly, I can't think of a case when the last code isn't atomic, I mean when two work items execute this and one of their changes is lost. However I'll test it to understand.

                        • Re: Implementing 64 bit atomics
                          notzed

                          Your updates to global p_val can't be forced to be seen outside of the work-group without using atomics.  And they would only be valid within the work-group if you use global barriers.  And barriers can't be conditional.  Which suddenly makes your loop a whole lot more complicated and slower.

                           

                          So although the code might guarantee atomicity of execution within and across multiple compute units, there can't be a guarantee of atomicity of data unless the data operations themselves are implemented using atomics.

                           

                          I've managed to avoid any global atomic usage outside of atomic counters - which can be and are implemented in hardware.  These let one implement lockless batch oriented queues without needing any serialisation at all.  It wont fit every problem but then again gpu's don't fit many problems.  The only other tool of global synchronisation I use is  kernel invocation - this is the only way to synchronise non-atomic updates to global data across compute units.

                  • Re: Implementing 64 bit atomics
                    MicahVillmow

                    This is incorrect, the GCN(Most HD7XXX devices) based devices have support for 64bit atomics. If they are not enabled by default(not sure what release they are enabled in), they can be enabled with the environment variable GPU_64BIT_ATOMICS=1.