4 Replies Latest reply on Aug 14, 2010 7:49 AM by philips

    unexpected barrier / mem_fence behavior

    philips
      barrier necessary even though mem_fence should be enough

       

      The following is a piece of code that I use in order bo batch a group of operations. Since I am porting from Cuda, I have kept the warp structure. The Cuda code, however, did not use any synchronization barriers. 

       

      At the start of the kernel I use fetchWorkFirst. Here it doesn't seem to matter if I use barrier or mem_fence.

      Then there is a loop at the end of which I call fetchWorkNext.

       

       

      The workgroup size is 32 x 2. The work items with get_local_id(0) == 0  read from a global variable to find out which data the 32 work items should work on and subsequently writes that value to a local variable.  

       

      Now I would think that a mem_fence(CLK_LOCAL_MEM_FENCE) should be sufficient. So the following 31 work items woul have to wait to read the value in the local variable that was just written to in workitem 0.

      However, using mem_fence the result of the kernel is wrong. It works with a barrier, however.

      Why does mem_fence not work here?


      void fetchWorkFirst(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { if (get_local_id(0) == 0) *sharedTemp = atom_add(g_warpCounter, batchSize); mem_fence(CLK_LOCAL_MEM_FENCE); *warp = *sharedTemp; *batchCounter = batchSize; } void fetchWorkNext(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { (*batchCounter)--; if (*batchCounter > 0) (*warp)++; else { if (get_local_id(0) == 0) *sharedTemp = atom_add(g_warpCounter, batchSize); barrier(CLK_LOCAL_MEM_FENCE); *batchCounter = batchSize; *warp = *sharedTemp; } }

        • unexpected barrier / mem_fence behavior
          genaganna

           

          Originally posted by: philips The following is a piece of code that I use in order bo batch a group of operations. Since I am porting from Cuda, I have kept the warp structure. The Cuda code, however, did not use any synchronization barriers. 

           At the start of the kernel I use fetchWorkFirst. Here it doesn't seem to matter if I use barrier or mem_fence.

          Then there is a loop at the end of which I call fetchWorkNext.

          The workgroup size is 32 x 2. The work items with get_local_id(0) == 0  read from a global variable to find out which data the 32 work items should work on and subsequently writes that value to a local variable.  

           Now I would think that a mem_fence(CLK_LOCAL_MEM_FENCE) should be sufficient. So the following 31 work items woul have to wait to read the value in the local variable that was just written to in workitem 0.

          However, using mem_fence the result of the kernel is wrong. It works with a barrier, however.

          Why does mem_fence not work here?

          Philips,

                    mem_fence works per work-item not work-group. There is race condition in your code. two threads written to same memory(sharedTemp). of course you get always same value in sharedTemp as you are using atomics.  As per my understanding, barrier is only appropriate here.

          Could you please post your cuda code also here?

            • unexpected barrier / mem_fence behavior
              philips

              Thank you.

               

              Here is the CUDA code.

               

               

               

              __device__ void fetchWorkFirst(int& warp, int& batchCounter, int* warpCounter, int batchSize, volatile S32& sharedTemp) { if (threadIdx.x == 0) sharedTemp = atomicAdd(warpCounter, batchSize); warp = sharedTemp; batchCounter = batchSize; } __device__ void fetchWorkNext(int& warp, int& batchCounter, int* warpCounter, int batchSize, volatile S32& sharedTemp) { batchCounter--; if (batchCounter > 0) warp++; else { if (threadIdx.x == 0) sharedTemp = atomicAdd(warpCounter, batchSize); batchCounter = batchSize; warp = sharedTemp; } }

                • unexpected barrier / mem_fence behavior
                  genaganna

                   

                  Originally posted by: philips Thank you.

                   

                  Here is the CUDA code.

                   

                   

                   

                  On which device you are running.  Could  you please us complete code?  you can also send code to streamdeveloper@amd.com.

                    • unexpected barrier / mem_fence behavior
                      philips

                      I tried to cut down my OpenCL code as much as possible. Even with the barrier there is a slight problem, I can't explain.

                      ------------------------------------------------

                      It uses 32 x 2 work-groups, so two warps.

                      fetchWorkTemp is a local memory array. One int for each warp. fetchworktemp0 points to the value for the current warp.

                      framePtr is a pointer to the pixel in the output image. The pixel is decided by the warp number and local_id. Then the color (it's a green tone) is written to the pixel.

                      The important part is the warp number. The work item stays active and fetches new work until all pixels are finished. The first work-item gets the warm number via the global warpCounter and increases the counter in batches of 3. Then all work-items in the warp save that number (*warp = *sharedTemp). When the batch is done, fetchWorkNext fetches the next batch.

                       

                      The exact behavior/problem:

                       

                      - The NVIDIA GPU runs the kernel correctly even without barrier/mem_fence. Probably because of how all threads run in lockstep/SIMD and automatically have to wait till the first work-item has fetched the warp number.

                       

                      - when, instead of a barrier, I just use a mem_fence on the CPU, only the first pixel of a warp gets colored and all the other pixels stay black.

                       

                      - when I use a barrier on the CPU, it works correctly for most of the picture, but there are errors for the first couple of warps. On my single core CPU (Athlon XP 3000+), everything turns green, but the first batch. For the first three warps only the first work-item works correctly and the other pixels stay black (just as with mem_fence, but not the whole image) On a multi-core CPU (Core2Quad or Xeon) the effect happens for more than the first batch.

                       

                       

                       

                      ------------------------------------------------

                      the full CUDA program is available online: http://code.google.com/p/efficient-sparse-voxel-octrees/

                       

                       

                      void fetchWorkFirst(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { if (get_local_id(0) == 0) *sharedTemp = atomic_add(g_warpCounter, batchSize); barrier(CLK_LOCAL_MEM_FENCE); *batchCounter = batchSize; *warp = *sharedTemp; } void fetchWorkNext(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { (*batchCounter)--; if (*batchCounter > 0) (*warp)++; else { if (get_local_id(0) == 0) *sharedTemp = atomic_add(g_warpCounter, batchSize); barrier(CLK_LOCAL_MEM_FENCE); *batchCounter = batchSize; *warp = *sharedTemp; } } __kernel void renderKernel(__constant StandardInput* input, __global int* g_warpCounter, __global unsigned int* g_frame) { __local int fetchWorkTemp[2]; volatile __local int* fetchWorkTemp0 = &fetchWorkTemp[get_local_id(1)]; __global int* framePtr; // fetch first warp of work int warp, batchCounter; fetchWorkFirst(&warp, &batchCounter, g_warpCounter, (*input).batchSize, fetchWorkTemp0); if (warp * 32 >= (*input).totalWork) return; // terminate before starting at all // main warp loop for (;;) { // ray index int pidx = warp * 32 + get_local_id(0); if (pidx >= (*input).totalWork) return; // terminate individual rays framePtr = (__global U32*)g_frame + pidx; *framePtr = 10000000; //color // fetch more work fetchWorkNext(&warp, &batchCounter, g_warpCounter, (*input).batchSize, fetchWorkTemp0); } }