cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pdsmith
Journeyman III

mem_fence() needed for one wavefront?

My understanding is that the following reduction does not require memory fences as long as the workgroup size = one wavefront (64 on 7950).

However the following gives incorrect results unless I add a mem_fence(local) after each write. I am using __attribute__((reqd_work_group_size(64,1,1))) in my kernel.

I have successfully used this code before. Either something has changed, or I have found a bug in my own code.

    // Reduction min

  dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+32] );

  dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+16] );

  dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+8 ] );

  dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+4 ] );

  dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+2 ] );

  // Write to global memory

  if (loc_mempos == 0)

    {

      d_min_max_workgroup[get_group_id(0)] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+1 ] );

    }

Info:

Fedora 19. ATI 13.8. HD 7950

0 Likes
1 Reply
himanshu_gautam
Grandmaster

As of my understanding ... though you set the workgroup size as 1 wavefront ie 64. Everytime Quarter wavefornt ie 16 thread only will act at once. So we need to have mem fence to have the synchronization between these. Hopefully this may be the reason for incorrect answer when u r not adding mem_fence.

Share in case if you get any other reasons.

0 Likes