5 Replies Latest reply on Jun 14, 2011 6:52 AM by spectral

    Reduction code is slower in CPU (OpenCL) compared to CPU serial

    erman_amd

      Hi,

       

      I tried the Reduction sample from AMD SDK.

       

      __kernel

      void 

      reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)

      {

          // load shared mem

          unsigned int tid = get_local_id(0);

          unsigned int bid = get_group_id(0);

          unsigned int gid = get_global_id(0);

       

          unsigned int localSize = get_local_size(0);

          sdata[tid] = input[gid];

          barrier(CLK_LOCAL_MEM_FENCE);

       

          // do reduction in shared mem

          for(unsigned int s = localSize / 2; s > 0; s >>= 1) 

          {

              if(tid < s) 

              {

                  sdata[tid] += sdata[tid + s];

              }

              barrier(CLK_LOCAL_MEM_FENCE);

          }

       

          // write result for this block to global mem

          if(tid == 0) output[bid] = sdata[0];

      }



       

      When I tried to run on CPU, it runs slower compared to GPU (it's OK), but it also  runs slower than CPU serial version (the CPU serial version, I made it myself). The length of vector to reduce is 65536.