2 Replies Latest reply on Sep 25, 2013 1:49 AM by himanshu.gautam

    mem_fence() needed for one wavefront?

    pdsmith

      My understanding is that the following reduction does not require memory fences as long as the workgroup size = one wavefront (64 on 7950).

      However the following gives incorrect results unless I add a mem_fence(local) after each write. I am using __attribute__((reqd_work_group_size(64,1,1))) in my kernel.

      I have successfully used this code before. Either something has changed, or I have found a bug in my own code.

       

          // Reduction min

        dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+32] );

        dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+16] );

        dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+8 ] );

        dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+4 ] );

        dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+2 ] );

       

       

        // Write to global memory

        if (loc_mempos == 0)

          {

            d_min_max_workgroup[get_group_id(0)] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+1 ] );

          }

       

      Info:

       

      Fedora 19. ATI 13.8. HD 7950

        • Re: mem_fence() needed for one wavefront?
          himanshu.gautam

          As of my understanding ... though you set the workgroup size as 1 wavefront ie 64. Everytime Quarter wavefornt ie 16 thread only will act at once. So we need to have mem fence to have the synchronization between these. Hopefully this may be the reason for incorrect answer when u r not adding mem_fence.

           

          Share in case if you get any other reasons.