Archives Discussions

ekondis · ‎10-05-2014

I was comparing shared memory reductions with reductions based on the new workgroup reduction functions available in OpenCL 2.0. Though I believed the workgroup function would lead to even more optimized code this does not seem to be the case. On a Bonaire GPU I get about 86.29 GB/sec device memory bandwidth with the shared memory usage and just 62.35 GB/sec with the workgroup reduction function. Both kernels are depicted bellow.


#define GRANULARITY 128

#define WAVEFRONT_SIZE 64


__kernel void ReductionShMem(__local volatile int *localBuffer, __global const int *data, __global int *result, const unsigned int n) {

  // Get our global thread ID

  int id = get_global_id(0);

  const int lid = get_local_id(0);

  const int group_size = get_local_size(0);


  int tmp = data[id];

  #pragma unroll

  for(int i=1; i<GRANULARITY; i++){

  id += get_global_size(0);

  tmp = tmp + data[id];

  }

  localBuffer[lid] = tmp;

  barrier(CLK_LOCAL_MEM_FENCE);


  // local memory reduction

  int i = group_size/2;

  for(; i>WAVEFRONT_SIZE; i >>= 1) {

  if(lid < i)

  localBuffer[lid] = tmp = tmp + localBuffer[lid + i];

  barrier(CLK_LOCAL_MEM_FENCE);

  }

  // wavefront reduction

  for(; i>0; i >>= 1) {

  if(lid < i)

  localBuffer[lid] = tmp = tmp + localBuffer[lid + i];

  }

  // atomic reduce in global memory

  if(lid==0){

  atomic_add((__global int*)result, tmp);

  }

}


__kernel void ReductionWrkGrp(__local volatile int *localBuffer, __global const int *data, __global int *result, const unsigned int n) {

  // Get our global thread ID

  int id = get_global_id(0);

  const int lid = get_local_id(0);

  const int group_size = get_local_size(0);


  int tmp = data[id];

  #pragma unroll

  for(int i=1; i<GRANULARITY; i++){

  id += get_global_size(0);

  tmp = tmp + data[id];

  }


  // workgroup reduction (introduced in OpenCL 2.0)

  int res = work_group_reduce_add(tmp);


  // atomic reduce in global memory

  if(lid==0){

  atomic_add((__global int*)result, res);

  }

}

I believe the OpenCL 2.0 driver had not been yet optimized on the new workgroup functions.

dipak · ‎10-06-2014

AMD has released its OpenCL 2.0 compatible driver but not the SDK. The performance of the driver may not be optimized one. As per this thread OpenCL™ 2.0 is here!:

@@ jtrudeau Sep 26, 2014 3:08 AM:

AMD has released its first fully-functional OpenCL 2.0 driver aimed at early adopter developers. We’re not broadly promoting this driver because we have a performance-optimized driver coming out shortly and we suggest you wait for that version. We plan to release a new SDK when the updated driver becomes available to help developers come up to speed. That said, we welcome any feedback from the community who are interested in test-driving this version of the driver.

As for the SVM capabilities, this driver fully supports the core OpenCL 2.0 features - including course-grained SVM. AMD has not yet publicly disclosed plans for OpenCL 2.0 optional features - stay tuned for further details.

The newly released SDK 2.9-1 doesn't support OpenCL 2.0. So, I'm little curious to know which package did you use to build OpenCL 2.0 program?

ekondis · ‎10-06-2014

The newly released SDK 2.9-1 doesn't support OpenCL 2.0. So, I'm little curious to know which package did you use to build OpenCL 2.0 program?

AFAIK, the majority of the OpenCL functionality in included on the GPU driver. So, I didn't have to wait for the SDK. I didn't use any new host functions afterall. All I had to do was to add the "-cl-std=CL2.0" option in the options argument of clBuildProgram in order to instruct the compiler to build the kernel as being of version 2.0.

dipak · ‎10-07-2014

Yes. OpenCL 2.0 kernel side code can be compiled and run in this way. But, as mentioned in the above post, there may be some performance issue regarding some new features. As I didn't do much performance analysis with work group functions, can't sure about this. Hoping that performance will be improved with new releases.

Regards,

sharpneli · ‎11-13-2014

Considering that GCN supports register shuffling instructions the theoretical performance of work_group_reduce_add and sub_group_reduce_add should be great deal faster than usual way of going trough local memory.

I really hope AMD is going to optimize this function to use those, as a normal user cannot access that feature in GCN via OpenCL. On CUDA the register shuffle is available to developers and it gives massive performance boost. Without that there is no hope in matching CUDA cards in performance for problems that use reduction across workitems.

I did a simple test with sub_group_reduce_add. It was really slow (Hawaii based HW) and it also was rather buggy. Instead of getting the expected result sometimes I got nan and 0.

dipak · ‎11-14-2014

Thanks for this suggestion.

I did a simple test with sub_group_reduce_add. It was really slow (Hawaii based HW) and it also was rather buggy. Instead of getting the expected result sometimes I got nan and 0.

It would be great help if you can provide us a test case that manifests the above bug in OpenCL2.0 driver.

Regards,

ekondis · ‎12-20-2014

I've just provided a test case source code link on github on a new thread exposing the low performance of the reduction workgroup function: http://devgurus.amd.com/thread/169868

dipak · ‎12-22-2014

Thanks for sharing the testcase. We'll check and get back to you.

Regards,

Archives Discussions

Evaluating workgroup reductions in OpenCL 2.0