Archives Discussions

lantonov · ‎06-10-2011

I have kernel that uses barriers and I try to use a group size equal to the wavefront size. As I understand it, the wavefront size is 64 for all HD 4x000 GPUs.

It does not help if remove the __attribute__ - the only way to get it to work is to remove the barrier() calls. I have seen some other posts about a similar problem, but they seem to indicate that a size of 64 should work. In my case, 32 works but not 64. Any idea what's going on? I guess the compiler is still not fixed - I see messages about this from many months ago.

Yes, I know this GPU does not really have local memory. Yes, I know that technically I don;t have to use a barrier with a workgroup size that fits in a wavefront.

kernel __attribute__((reqd_work_group_size(GROUP_SIZE, 1, 1))) void sum_local_phase1(const global float* t_margin_Iq, // matrix of margin sums for each body, per q (aligned n(bodies), n(q)) global float* buf_sum) // matrix to pass to phase 2 (n(q), aligned n(bodies) / GROUP_SIZE) { int idx = get_global_id(0); // body index int idy = get_global_id(1); // q index int n_bodies_aln = get_global_size(0); // should be the aligned body count int n_q = get_global_size(1); // should be the q count int gidx = get_group_id(0); // x index of the workgroup int lid = get_local_id(0); // body index within the workgroup local float scratch[GROUP_SIZE]; scratch[lid] = t_margin_Iq[idy * n_bodies_aln + idx]; barrier(CLK_LOCAL_MEM_FENCE); // do reduction for the workgroup in shared mem for (int s = GROUP_SIZE / 2; s > 0; s >>= 1) { if(lid < s) { scratch[lid] += scratch[lid + s]; } barrier(CLK_LOCAL_MEM_FENCE); } // write the result for this workgroup to the internal buffer if (lid == 0) buf_sum[gidx * n_q + idy] = scratch[0]; }

maximmoroz · ‎06-10-2011

Well, most probably it means that your assumption (the wavefront size is 64 for all HD 4x000 GPUs) is wrong. And wavefront size for mobility HD4650 is actually 32.

lantonov · ‎06-10-2011

Are you guessing, or do you actually know?

Anyway, my assumption about the wavefront size is only relevant to this if the compiler somehow ties together the existance of barriers with both the wavefront size and group size, which is not normal IMO. There is nothing like that in the standard and it does not fail like this on nvidia gpus. The wavefront size is supposed to be opaque anyway, which is why we are supposed to use barriers.

So my question still stands - why is this failing.

maximmoroz · ‎06-10-2011

I am guessing.

I know that some low-end GPUs in 5xxx family have reduced (to 32) wavefront size. It is stated clearly in Programming Guide, section "4.11.4 Optimizing for Cedar". You have low-end GPU in 4xxx family and no direct link to the official AMD document which spcifies that all 4xxx devices have 64 wavefront size. Correct? I have good reasons to guess correctly.

My understanding is that AMD have chosen not the best way to resolve issue with barriers in 4xxx. I think it would be more correct just to set max_workgroup_size to the wavefront size (64 or 32) instead of current 256.

lantonov · ‎06-10-2011

You are right about the wavefront size - some googling finally yielded the info - for HD4670 it is 32. I wish there was a central place to look these things up, or even better, have OpenCL extensions for querying this.

maximmoroz · ‎06-10-2011

Originally posted by: lantonovI wish there was a central place to look these things up, or even better, have OpenCL extensions for querying this.

There is a standard way to do it in OpenCL 1.1: Use CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. I mentioned this approach in this thread.

lantonov · ‎06-10-2011

That's a good tip, thanks.

maximmoroz · ‎06-10-2011

And it works for CPU devices in AMD and Intel OpenCL implementation (4 for AMD one and 16 for Intel one, If I remember correctly).

Archives Discussions

Warning:W000:Barrier caused limited groupsize on Mobility HD4670 with group size of 64