cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

noah_r
Journeyman III

What is the cost of a barrier() operation?

In my OpenCL kernel, I need pass values between work items in the same workgroup many times equivalent to the following:

local float * local_array = // a local memory array

float a, b;

int wid = get_local_id(0);

int wgsize = get_local_size(0);

for( int i = 0; i<12; ++i )

{

   a = //do some math

  local_array[wid] = a;

  barrier(CLK_LOCAL_MEM_FENCE);  // How costly is this?

  b = local_array[(wid+1)%wgsize]; // neighbor's array element

  // then work with a and b

}

Can I expect this barrier to be eliminated by the compiler if my workgroup size is smaller or equal to the hardware thread size?

I would like this kernel to work well on both CPU and GPU devices.  I've read elsewhere that barrier() on CPU results in a lightweight context switch every time.  How can I avoid that?

In my algorithm, there is a trade off to the number of barriers needed, to the size of the local memory array.  Fewer barriers means larger local memory size per work item.  Does that help guide my decision?

By the way, I've spent too much time determining if men_fence() is an acceptable substitute for barrier() in this case and have concluded NO; it is not.  Googling can find pages and threads arguing both ways.  I'm reading the OpenCL spec and deciding men_fence() has nothing to do with synchronization between work items.  But please comment if I'm wrong.

0 Likes
1 Solution

Why would you want a workgroup size greater than 1? You should probably be working with more than one node per work item in both your CPU code and GPU code anyway, unless LDS is a limitation. If you do that and write your kernel sensibly you could just parameterise it. So, if you make WORKGROUP_SIZE a compile-time parameter that you change based on the device, you could do:

for( int i = 0; i += NUM_ELEMENTS_PER_WAVE; i+=WORKGROUP_SIZE ) {

   lds[get_local_id(0)] = memory[i*WORKGROUP_SIZE];

   barrier

   exchange data across workgroup
}

If WORKGROUP_SIZE is 1 you could make LDS private instead and just let the compiler run that through a register and let it drop the barrier. All your memory operations can then scale based on the workgroup size. The workgroup size should generally be the same as number of work items that maps into the hardware thread. On the GPU we map a work item per vector lane, so 64 per thread where the wavefront is a thread. On the CPU we don't currently map into the SSE vector so it's a single work item per thread. Any more than a single work item per thread on the CPU is likely to give poor performance. The only real reason for supporting it in the runtime is for correctness portability.

I write no OpenCL code that uses a workgroup size that isn't either 1, 32 or 64 currently, and it switches depending on the target architecture. Most algorithms I've found are relatively easy to parameterise to match.

View solution in original post

0 Likes
10 Replies
Wenju
Elite

Hi noah_r,

I think the barrier will not be eliminated by the compiler. And I think the barrier should be used in right place, because the performance will be low if you used it too much. mem_fence is not a synchronization function.

0 Likes

I think barriers should be necessary in the case shown above. The work-items must finish writing into local buffer before reading from it.

0 Likes

I do need a method like barrier() to ensure local memory consistency between work items as you suggest. I tried to make my example as simple as possible.  This sort of sharing between work items is required for my numerical methods in computational fluid dynamics.  Flux, or amount of fluid, that flows between cells in a domain must be communicated to neighboring work items that are advancing the neighboring cell solution.

0 Likes

Thank you for confirming men_fence is not synchronization.

0 Likes
vladant
Journeyman III

I think you should write another CPU device specific kernel that simply serialize these operation so no barrier will be needed. Forget about the same kernel working efficiently on both GPU and CPU.

0 Likes

The idea of a separate CPU and GPU kernel is certainly something to explore, despite the unpleasantness.

Even in a CPU only version though, I would want the workgroup size to be greater than 1.  I understand that I should specify the workgroup size to be some multiple of the CPU vector width in order to achieve good numerical performance.  Any workgroup size greater than 1 requires local memory sharing in the numerical method.

I work in computational fluid dynamics.   So translating your advice to make a 'serial' CPU kernel would practically mean make a kernel that computes multiple fluid cells (in series) per work item.   There will be a trade-off with more redundant global memory reads for smaller workgroup size.  Also, higher kernel memory/register footprint per physical cell to be calculated.

My original question still looks for other's experience and insight into the practical cost of the barrier() statement for GPU and CPU in relation to workgroup size.

0 Likes

Well, I can only guess: on GPU each computing unit (CU) have one LDS and many processing elements (PE), on CPU a cpu is computing unit that have one LDS and one processing element (maybe 2 if hyper-threading is enabled). On GPU you have to synchronize access to local memory with other PEs but on CPU you do not need it because you have only one PE. So when opencl encounter a barrier on CPU it just saves state of current kernel and calls another one for a second work-item.

wi1 - work,work,work - barrier-stall.........................................work - end

wi2 - stall...................................work,work,work - barrier-stall...............work - end

Switching could by done quickly but the most efficient way is not to use barrier at all, especially if you have lightweight kernel. Local memory on CPU is just an ordinary cached memory and you can share data between CUs (both on CPU on GPU) only via global memory (I guess that is when memory fences become handy).

P.S I am not sure if a barrier could be in global memory too but if it can on CPU it must be a nasty thing like spinlock. Do not know about GPU.

0 Likes

Why would you want a workgroup size greater than 1? You should probably be working with more than one node per work item in both your CPU code and GPU code anyway, unless LDS is a limitation. If you do that and write your kernel sensibly you could just parameterise it. So, if you make WORKGROUP_SIZE a compile-time parameter that you change based on the device, you could do:

for( int i = 0; i += NUM_ELEMENTS_PER_WAVE; i+=WORKGROUP_SIZE ) {

   lds[get_local_id(0)] = memory[i*WORKGROUP_SIZE];

   barrier

   exchange data across workgroup
}

If WORKGROUP_SIZE is 1 you could make LDS private instead and just let the compiler run that through a register and let it drop the barrier. All your memory operations can then scale based on the workgroup size. The workgroup size should generally be the same as number of work items that maps into the hardware thread. On the GPU we map a work item per vector lane, so 64 per thread where the wavefront is a thread. On the CPU we don't currently map into the SSE vector so it's a single work item per thread. Any more than a single work item per thread on the CPU is likely to give poor performance. The only real reason for supporting it in the runtime is for correctness portability.

I write no OpenCL code that uses a workgroup size that isn't either 1, 32 or 64 currently, and it switches depending on the target architecture. Most algorithms I've found are relatively easy to parameterise to match.

0 Likes

Thank you for the valuable insight.  I think this is as specific as we can be at this point without doing some performance testing with MY algorithm.

You seemed to imply that even if the compiler can readily see that a work item only accesses an 'exclusive' local array element, that the barrier() is not optimized away.  That's too bad.  To be more specific, your example had

lds[get_local_id(0)]

and you suggested that if workgroup size is 1, then I should convert that to private memory.  Maybe some day the compiler will be able to catch this.  At the same time, the language is limiting in that barrier() gives more than just memory synchronization, but work-item synchronization as well.

Other CPU implementations do claim to map to the SSE vector, so your suggestion about using both NUM_ELEMENTS_PER_WAVE and WORKGROUP_SIZE compile-time constants is a good one.


0 Likes
notzed
Challenger

barrier() will be removed if you use the reqd_work_group_size() attribute, and it is such that the hardware can support it.  i.e. a work item count of <= 64.

On a CPU this will be converted to groups of loops - so it may or may not be that much of a performance hit (or win even).  But for many cases trying to make the code the same for both classes of device will be sub-optimal and you need different code for each.

Barriers are like any other slightly costly operation - the cost only becomes prohibitive if you're not doing much else (e.g. a parallel sum), but if you're doing a lot of other ops the cost is negligible.  They are afterall a hardware feature.

0 Likes