7 Replies Latest reply on Dec 22, 2014 4:15 AM by dipak

    Evaluating workgroup reductions in OpenCL 2.0

    ekondis

      I was comparing shared memory reductions with reductions based on the new workgroup reduction functions available in OpenCL 2.0.  Though I believed the workgroup function would lead to even more optimized code this does not seem to be the case. On a Bonaire GPU I get about 86.29 GB/sec device memory bandwidth with the shared memory usage and just 62.35 GB/sec with the workgroup reduction function. Both kernels are depicted bellow.

       

      #define GRANULARITY 128
      #define WAVEFRONT_SIZE 64
      
      __kernel void ReductionShMem(__local volatile int *localBuffer, __global const int *data, __global int *result, const unsigned int n) {
        // Get our global thread ID
        int id = get_global_id(0);
        const int lid = get_local_id(0);
        const int group_size = get_local_size(0);
      
        int tmp = data[id];
        #pragma unroll
        for(int i=1; i<GRANULARITY; i++){
        id += get_global_size(0);
        tmp = tmp + data[id];
        }
        localBuffer[lid] = tmp;
        barrier(CLK_LOCAL_MEM_FENCE);
      
        // local memory reduction
        int i = group_size/2;
        for(; i>WAVEFRONT_SIZE; i >>= 1) {
        if(lid < i)
        localBuffer[lid] = tmp = tmp + localBuffer[lid + i];
        barrier(CLK_LOCAL_MEM_FENCE);
        }
        // wavefront reduction
        for(; i>0; i >>= 1) {
        if(lid < i)
        localBuffer[lid] = tmp = tmp + localBuffer[lid + i];
        }
        // atomic reduce in global memory
        if(lid==0){
        atomic_add((__global int*)result, tmp);
        }
      }
      
      __kernel void ReductionWrkGrp(__local volatile int *localBuffer, __global const int *data, __global int *result, const unsigned int n) {
        // Get our global thread ID
        int id = get_global_id(0);
        const int lid = get_local_id(0);
        const int group_size = get_local_size(0);
      
        int tmp = data[id];
        #pragma unroll
        for(int i=1; i<GRANULARITY; i++){
        id += get_global_size(0);
        tmp = tmp + data[id];
        }
      
        // workgroup reduction (introduced in OpenCL 2.0)
        int res = work_group_reduce_add(tmp);
      
        // atomic reduce in global memory
        if(lid==0){
        atomic_add((__global int*)result, res);
        }
      }
      
      

       

      I believe the OpenCL 2.0 driver had not been yet optimized on the new workgroup functions.

        • Re: Evaluating workgroup reductions in OpenCL 2.0
          dipak

          AMD has released its OpenCL 2.0 compatible driver but not the SDK. The performance of the driver may not be optimized one. As per this thread OpenCL™ 2.0 is here!:

           

          @@

          AMD has released its first fully-functional OpenCL 2.0 driver aimed at early adopter developers.  We’re not broadly promoting this driver because we have a performance-optimized driver coming out shortly and we suggest you wait for that version.  We plan to release a new SDK when the updated driver becomes available to help developers come up to speed.  That said, we welcome any feedback from the community who are interested in test-driving this version of the driver.

           

          As for the SVM capabilities, this driver fully supports the core OpenCL 2.0 features - including course-grained SVM.  AMD has not yet publicly disclosed plans for OpenCL 2.0 optional features - stay tuned for further details.

          The newly released SDK 2.9-1 doesn't support OpenCL 2.0. So, I'm little curious to know which package did you use to build OpenCL 2.0 program?

            • Re: Re: Evaluating workgroup reductions in OpenCL 2.0
              ekondis

              The newly released SDK 2.9-1 doesn't support OpenCL 2.0. So, I'm little curious to know which package did you use to build OpenCL 2.0 program?

               

              AFAIK, the majority of the OpenCL functionality in included on the GPU driver. So, I didn't have to wait for the SDK. I didn't use any new host functions afterall. All I had to do was to add the "-cl-std=CL2.0" option in the options argument of clBuildProgram in order to instruct the compiler to build the kernel as being of version 2.0.