Warning:W000:Barrier caused limited groupsize on Mobility HD4670 with group size of 64

Discussion created by lantonov on Jun 10, 2011
Latest reply on Jun 10, 2011 by maximmoroz

I have kernel that uses barriers and I try to use a group size equal to the wavefront size. As I understand it, the wavefront size is 64 for all HD 4x000 GPUs.

It does not help if remove the __attribute__ - the only way to get it to work is to remove the barrier() calls. I have seen some other posts about a similar problem, but they seem to indicate that a size of 64 should work. In my case, 32 works but not 64. Any idea what's going on? I guess the compiler is still not fixed - I see messages about this from many months ago.

Yes, I know this GPU does not really have local memory. Yes, I know that technically I don;t have to use a barrier with a workgroup size that fits in a wavefront.

kernel __attribute__((reqd_work_group_size(GROUP_SIZE, 1, 1))) void sum_local_phase1(const global float* t_margin_Iq, // matrix of margin sums for each body, per q (aligned n(bodies), n(q)) global float* buf_sum) // matrix to pass to phase 2 (n(q), aligned n(bodies) / GROUP_SIZE) { int idx = get_global_id(0); // body index int idy = get_global_id(1); // q index int n_bodies_aln = get_global_size(0); // should be the aligned body count int n_q = get_global_size(1); // should be the q count int gidx = get_group_id(0); // x index of the workgroup int lid = get_local_id(0); // body index within the workgroup local float scratch[GROUP_SIZE]; scratch[lid] = t_margin_Iq[idy * n_bodies_aln + idx]; barrier(CLK_LOCAL_MEM_FENCE); // do reduction for the workgroup in shared mem for (int s = GROUP_SIZE / 2; s > 0; s >>= 1) { if(lid < s) { scratch[lid] += scratch[lid + s]; } barrier(CLK_LOCAL_MEM_FENCE); } // write the result for this workgroup to the internal buffer if (lid == 0) buf_sum[gidx * n_q + idy] = scratch[0]; }