AnsweredAssumed Answered

Evaluating workgroup reductions in OpenCL 2.0

Question asked by ekondis on Oct 5, 2014
Latest reply on Dec 22, 2014 by dipak

I was comparing shared memory reductions with reductions based on the new workgroup reduction functions available in OpenCL 2.0.  Though I believed the workgroup function would lead to even more optimized code this does not seem to be the case. On a Bonaire GPU I get about 86.29 GB/sec device memory bandwidth with the shared memory usage and just 62.35 GB/sec with the workgroup reduction function. Both kernels are depicted bellow.

 

#define GRANULARITY 128
#define WAVEFRONT_SIZE 64

__kernel void ReductionShMem(__local volatile int *localBuffer, __global const int *data, __global int *result, const unsigned int n) {
  // Get our global thread ID
  int id = get_global_id(0);
  const int lid = get_local_id(0);
  const int group_size = get_local_size(0);

  int tmp = data[id];
  #pragma unroll
  for(int i=1; i<GRANULARITY; i++){
  id += get_global_size(0);
  tmp = tmp + data[id];
  }
  localBuffer[lid] = tmp;
  barrier(CLK_LOCAL_MEM_FENCE);

  // local memory reduction
  int i = group_size/2;
  for(; i>WAVEFRONT_SIZE; i >>= 1) {
  if(lid < i)
  localBuffer[lid] = tmp = tmp + localBuffer[lid + i];
  barrier(CLK_LOCAL_MEM_FENCE);
  }
  // wavefront reduction
  for(; i>0; i >>= 1) {
  if(lid < i)
  localBuffer[lid] = tmp = tmp + localBuffer[lid + i];
  }
  // atomic reduce in global memory
  if(lid==0){
  atomic_add((__global int*)result, tmp);
  }
}

__kernel void ReductionWrkGrp(__local volatile int *localBuffer, __global const int *data, __global int *result, const unsigned int n) {
  // Get our global thread ID
  int id = get_global_id(0);
  const int lid = get_local_id(0);
  const int group_size = get_local_size(0);

  int tmp = data[id];
  #pragma unroll
  for(int i=1; i<GRANULARITY; i++){
  id += get_global_size(0);
  tmp = tmp + data[id];
  }

  // workgroup reduction (introduced in OpenCL 2.0)
  int res = work_group_reduce_add(tmp);

  // atomic reduce in global memory
  if(lid==0){
  atomic_add((__global int*)result, res);
  }
}

 

I believe the OpenCL 2.0 driver had not been yet optimized on the new workgroup functions.

Outcomes