Archives Discussions

noah_r · ‎07-30-2012

In my OpenCL kernel, I need pass values between work items in the same workgroup many times equivalent to the following:

local float * local_array = // a local memory array

float a, b;

int wid = get_local_id(0);

int wgsize = get_local_size(0);

for( int i = 0; i<12; ++i )

{

a = //do some math

local_array[wid] = a;

barrier(CLK_LOCAL_MEM_FENCE); // How costly is this?

b = local_array[(wid+1)%wgsize]; // neighbor's array element

// then work with a and b

}

Can I expect this barrier to be eliminated by the compiler if my workgroup size is smaller or equal to the hardware thread size?

I would like this kernel to work well on both CPU and GPU devices. I've read elsewhere that barrier() on CPU results in a lightweight context switch every time. How can I avoid that?

In my algorithm, there is a trade off to the number of barriers needed, to the size of the local memory array. Fewer barriers means larger local memory size per work item. Does that help guide my decision?

By the way, I've spent too much time determining if men_fence() is an acceptable substitute for barrier() in this case and have concluded NO; it is not. Googling can find pages and threads arguing both ways. I'm reading the OpenCL spec and deciding men_fence() has nothing to do with synchronization between work items. But please comment if I'm wrong.

LeeHowes · ‎07-31-2012

Why would you want a workgroup size greater than 1? You should probably be working with more than one node per work item in both your CPU code and GPU code anyway, unless LDS is a limitation. If you do that and write your kernel sensibly you could just parameterise it. So, if you make WORKGROUP_SIZE a compile-time parameter that you change based on the device, you could do:

for( int i = 0; i += NUM_ELEMENTS_PER_WAVE; i+=WORKGROUP_SIZE ) {

lds[get_local_id(0)] = memory[i*WORKGROUP_SIZE];

barrier

exchange data across workgroup
}

If WORKGROUP_SIZE is 1 you could make LDS private instead and just let the compiler run that through a register and let it drop the barrier. All your memory operations can then scale based on the workgroup size. The workgroup size should generally be the same as number of work items that maps into the hardware thread. On the GPU we map a work item per vector lane, so 64 per thread where the wavefront is a thread. On the CPU we don't currently map into the SSE vector so it's a single work item per thread. Any more than a single work item per thread on the CPU is likely to give poor performance. The only real reason for supporting it in the runtime is for correctness portability.

I write no OpenCL code that uses a workgroup size that isn't either 1, 32 or 64 currently, and it switches depending on the target architecture. Most algorithms I've found are relatively easy to parameterise to match.

View solution in original post

Archives Discussions

What is the cost of a barrier() operation?