4 Replies Latest reply on Dec 3, 2011 5:58 AM by Meteorhead

    implicit(?) sync

    Meteorhead
      threads conflict in __local

      Hi!

      I would like to ask something concerning __local storage. Is it possible that a thread conflicts with itself when writing to __local? I have encountered a problem, where according to the algorithm, threads should not be able to conflict with reading and writing each other's memory (I placed sync commands at every corner), but simulation data became corrupt, which does not occur if I introduce atomic_xor() into __local updating (which without it would be just ^=).

      I am quite certain, that the algorithm does not allow threads to conflict, but let me ask: how are __local accesses arranged inside a single thread, and outside a warp but inside the workgroup?

      Code is attached. The reason behind the 3 updates is that on the simulated 2D lattice, one update consists of updating self, and two neighbours, but because of bit coding, and one integer holding 4by4 lattice sites, and sites are picked at random, there is no telling at compile time if self and neighbours will reside inside the same integer, or in different ones. There we calculate indexes and shifts with a function. Vector elements are spaced far enough from others, that they definately do not conflict with other vector elements. (So basically conflict either happens on lane x or y or z or w.

      inline void updateVector(const uint4 coordsX, const uint4 coordsY, const uint4 flip, __local uint* cache) { uint4 index; uint4 shift; uint4 mask; uint4 notOnDeadBorder; notOnDeadBorder = (((coordsX + (uint4)(1u)) == (uint4)(LOCAL_WIDTH_IN_LATTICES)) || ((coordsY + 1u) == (uint4)(LOCAL_HEIGHT_IN_LATTICES))) ? (uint4)(0u,0u,0u,0u) : (uint4)(3u,3u,3u,3u) ; // Update self index = convert_uint4(floor(0.250001f * convert_float4(coordsY % LOCAL_HEIGHT_IN_LATTICES)) * LOCAL_WIDTH_IN_INTS) + convert_uint4(floor(0.250001f * convert_float4(coordsX % LOCAL_WIDTH_IN_LATTICES))); shift = 30 - ( ((coordsY % 4u) * 8u) + ((coordsX % 4u) * 2u) ); mask = ((flip && notOnDeadBorder) ? (uint4)(3u,3u,3u,3u) : (uint4)(0u,0u,0u,0u) ) << shift; atomic_xor( &cache[index.s0], mask.s0 ); atomic_xor( &cache[index.s1], mask.s1 ); atomic_xor( &cache[index.s2], mask.s2 ); atomic_xor( &cache[index.s3], mask.s3 ); // Update right neighbour index = convert_uint4(floor(0.250001f * convert_float4(coordsY % LOCAL_HEIGHT_IN_LATTICES)) * LOCAL_WIDTH_IN_INTS) + convert_uint4(floor(0.250001f * convert_float4((coordsX + 1u) % LOCAL_WIDTH_IN_LATTICES))); shift = 30 - ( ((coordsY % 4u) * 8u) + (((coordsX + 1u) % 4u) * 2u) ); mask = ((flip && notOnDeadBorder) ? (uint4)(1u,1u,1u,1u) : (uint4)(0u,0u,0u,0u)) << shift; atomic_xor( &cache[index.s0], mask.s0 ); atomic_xor( &cache[index.s1], mask.s1 ); atomic_xor( &cache[index.s2], mask.s2 ); atomic_xor( &cache[index.s3], mask.s3 ); // Update bottom neighbour index = convert_uint4(floor(0.250001f * convert_float4((coordsY + 1u) % LOCAL_HEIGHT_IN_LATTICES)) * LOCAL_WIDTH_IN_INTS) + convert_uint4(floor(0.250001f * convert_float4(coordsX % LOCAL_WIDTH_IN_LATTICES))); shift = 30 - ( (((coordsY + 1u) % 4u) * 8u) + ((coordsX % 4u) * 2u) ); mask = ((flip && notOnDeadBorder) ? (uint4)(2u,2u,2u,2u) : (uint4)(0u,0u,0u,0u)) << shift; atomic_xor( &cache[index.s0], mask.s0 ); atomic_xor( &cache[index.s1], mask.s1 ); atomic_xor( &cache[index.s2], mask.s2 ); atomic_xor( &cache[index.s3], mask.s3 ); }

        • implicit(?) sync
          himanshu.gautam

          When all the threads update self and their two neighbours, shouldn't multiple threads be updating the same location in __local memory.

          Anyways, does the ISA generated contain the xor operations with xyzw labels.

          Also It would be helpful if you paste the host code, so a reprduction is simplified in case it is indedd a bug.

            • implicit(?) sync
              Meteorhead

              No, collision is avoided by decomposition of the simulated system by updating in a manner (see code) like this. This is the update scheme of four threads seperated by double lines. Each thread decides uniformly to update inside integers 1,2,3 or 4. x,y,z and w are the spacing of vector. (Basically, one thread can do the work of four, by using all 4 lanes of a vector) Since decision of 1-2-3-4 is uniform across all threads inside a WG, if the indexers are right (which most likely is), collision should not happen, even if neighbours reside in a different integer, because the update states 1-2-3-4 are spaced far enough from each other.

              Anyhow, I would gladly look into the ISA on my Windows, but Kernel Analyzer is still broken (see GPU Developers section for topic), so I'll go over to linux for that. Other than that, I'll test it on NV also, see if it is also behaves strange there.

               

              [1.x][2.x][1.y][2.y]||[1.x][2.x][1.y][2.y] [3.x][4.x][3.y][4.y]||[3.x][4.x][3.y][4.y] [1.z][2.z][1.w][2.w]||[1.z][2.z][1.w][2.w] [3.z][4.z][3.w][4.w]||[3.z][4.z][3.w][4.w] --------------------++-------------------- --------------------++-------------------- [1.x][2.x][1.y][2.y]||[1.x][2.x][1.y][2.y] [3.x][4.x][3.y][4.y]||[3.x][4.x][3.y][4.y] [1.z][2.z][1.w][2.w]||[1.z][2.z][1.w][2.w] [3.z][4.z][3.w][4.w]||[3.z][4.z][3.w][4.w]

                • implicit(?) sync
                  Meteorhead

                  I have packed source code into a pack at:

                  http://www.kfki.hu/~mnagy/2dKPZ_CL_v1-0.zip

                  VS2010 project, interesting point is inside kernel code, heavily commented.

                  As for your question: no, data storage is not going down 4 lanes of a vector according to IL, but LDS_STORE commands go down x-lane of vectors subsequent of each other. (This is atomic behaviour) Issuing a barrier between going to next iteration solves the problem, but there is really no need for barrier, mem_fence(LOCAL) should be enough, as that should prevent next iteration to read prior to previous stores making it to LDS. Explicit loop unroll does not solve the issue also.

                    • implicit(?) sync
                      Meteorhead

                      While we're at it. Let me get this straight: is there any reason for an algorithm to use barriers instead of mem_fences? Whenever one comes across a synchronization issue, it really doesn't matter how exactly threads execute, but what's important is that the programmer wants to restrict some memory coherency on the workgroup. AFAIK mem_fence prevents memory operations to overlap before and after the fence, both read and write. Barrier opposing to this behaviour synchronizes threads AND issues one type of mem_fence. But are there really algorithms that require barriers?

                      This certain algorithm works fine with atomics and barrier, but does not with mem_fence, although in my mind it should be sufficient. While last wavefront is still making update on based on data read to registers, the other wavefronts could safely start next iteration with random number generation and the likes, and threads must hit a stall upon access of __shared as long as the last wavefronts write to LDS has not completed, since there is a mem_fence(). Simplified scheme of kernel loop:

                      for() { GEN PRNG;  READ LDS   ;  CALC   ; WRITE LDS   ;  FENCE   ; }

                      Atomics now work fine (inside write lds), but that is completely unneccessary. Disabling the first of PRNG calls and hardcoding it's output to const 1 OR 2 OR 3 OR 4 (that decide whether lattice groups 1-2-3-4 should be updated), sim data stays intact, even with for loop having > 1 length. Enabling this PRNG, sim data becomes corrupt with FENCE, but stays intact with BARRIER. Even if the algorithm is misconditioned (buggy), there should be no behavioral difference.

                      If all my assupmtions seem reasonable, then I believe that it is not my wrong doing that this code does not work. (It works alright, but people in the research group wanted to run it at home on GTS250, and compiler crashes saying "unsopperted operation", which most likely is caused by atomics. Also, on Teslas, performance seems abysmal. On AMD HW, there seems only be a 10% perf diff between using atomics or simply mem_fence (it seems the 5870 is adept at using atomics with few conflicts), but I suspect this is what hurts perf so much on NV, so I thought I'd go about avoiding atomics.