5 Replies Latest reply on Jun 14, 2011 6:52 AM by spectral

    Reduction code is slower in CPU (OpenCL) compared to CPU serial




      I tried the Reduction sample from AMD SDK.




      reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)


          // load shared mem

          unsigned int tid = get_local_id(0);

          unsigned int bid = get_group_id(0);

          unsigned int gid = get_global_id(0);


          unsigned int localSize = get_local_size(0);

          sdata[tid] = input[gid];



          // do reduction in shared mem

          for(unsigned int s = localSize / 2; s > 0; s >>= 1) 


              if(tid < s) 


                  sdata[tid] += sdata[tid + s];





          // write result for this block to global mem

          if(tid == 0) output[bid] = sdata[0];



      When I tried to run on CPU, it runs slower compared to GPU (it's OK), but it also  runs slower than CPU serial version (the CPU serial version, I made it myself). The length of vector to reduce is 65536.