1 Reply Latest reply on Aug 18, 2010 11:18 AM by genaganna

    Work group synchronization problem on the CPU

    philips
      first atomic_add fails?

      Hi.

       

      Here is a picture. Note the black pixels. They should be green, too. Two of the 64 pixels in a work group (1 per warp) are green however. Also note: in the top right corner there are two black pixels.

      http://www.image-share.com/ipng-326-66.html

       

      I have already mentioned my problem here http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=135153 but I think it deserves its own thread.

       

       

      The goal is to keep the same work group persistently on a core and the work groups fetch new jobs until everything is finished.

      To accomplish this, there is a global variable that is set to 0 before the kernel is launched. In a loop the work groups use atomic_add to get the current number and increment it.

      (on an NVIDIA GPU this problem does not occur)

       

      In the example code I just want to paint every pixel green. It works for all jobs except the first job on every work group.

       

      -----------------------------NOTES ON CODE-------------

       

      It uses 32 x 2 work-groups, so two warps.

       

      fetchWorkTemp is a local memory array. One int for each warp. fetchworktemp0 points to the value for the current warp.

       

      framePtr is a pointer to the pixel in the output image. The pixel is decided by the warp number and local_id. Then the color (it's a green tone) is written to the pixel.

       

      The important part is the warp number. The work item stays active and fetches new work until all pixels are finished. The first work-item gets the warm number via the global warpCounter and increases the counter in batches of 3. Then all work-items in the warp save that number (*warp = *sharedTemp). When the batch is done, fetchWorkNext fetches the next batch.

       

      The exact behavior/problem

      - The NVIDIA GPU runs the kernel correctly even without barrier. Probably because of how all threads run in lockstep/SIMD and automatically have to wait till the first work-item has fetched the warp number.

      - when I use a barrier on the CPU, it works correctly for most of the picture, but there are errors for the first couple of warps. On my single core CPU (Athlon XP 3000+), everything turns green, but the first batch. For the first three warps only the first work-item works correctly and the other pixels stay black (just as with mem_fence, but not the whole image) On a multi-core CPU (Core2Quad or Xeon) the effect happens for more than the first batch.

       

       

      Do you see any problems with the code? or have I stumbled upon a bug?

       

       

       

       

      ------------------------------------------------

      the full CUDA program I am porting is available online:

      http://code.google.com/p/efficient-sparse-voxel-octrees/

       

      void fetchWorkFirst(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { if (get_local_id(0) == 0) *sharedTemp = atomic_add(g_warpCounter, batchSize); barrier(CLK_LOCAL_MEM_FENCE); *batchCounter = batchSize; *warp = *sharedTemp; } void fetchWorkNext(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { (*batchCounter)--; if (*batchCounter > 0) (*warp)++; else { if (get_local_id(0) == 0) *sharedTemp = atomic_add(g_warpCounter, batchSize); barrier(CLK_LOCAL_MEM_FENCE); *batchCounter = batchSize; *warp = *sharedTemp; } } __kernel void renderKernel(__constant StandardInput* input, __global int* g_warpCounter, __global unsigned int* g_frame) { __local int fetchWorkTemp[2]; volatile __local int* fetchWorkTemp0 = &fetchWorkTemp[get_local_id(1)]; __global int* framePtr; // fetch first warp of work int warp, batchCounter; fetchWorkFirst(&warp, &batchCounter, g_warpCounter, (*input).batchSize, fetchWorkTemp0); if (warp * 32 >= (*input).totalWork) return; // terminate before starting at all // main warp loop for (;;) { // ray index int pidx = warp * 32 + get_local_id(0); if (pidx >= (*input).totalWork) return; // terminate individual rays framePtr = (__global U32*)g_frame + pidx; *framePtr = 10000000; //color // fetch more work fetchWorkNext(&warp, &batchCounter, g_warpCounter, (*input).batchSize, fetchWorkTemp0); } }