Reduction code is slower in CPU (OpenCL) compared to CPU serial

Discussion created by erman_amd on Jun 12, 2011
Latest reply on Jun 14, 2011 by spectral



I tried the Reduction sample from AMD SDK.




reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)


    // load shared mem

    unsigned int tid = get_local_id(0);

    unsigned int bid = get_group_id(0);

    unsigned int gid = get_global_id(0);


    unsigned int localSize = get_local_size(0);

    sdata[tid] = input[gid];



    // do reduction in shared mem

    for(unsigned int s = localSize / 2; s > 0; s >>= 1) 


        if(tid < s) 


            sdata[tid] += sdata[tid + s];





    // write result for this block to global mem

    if(tid == 0) output[bid] = sdata[0];



When I tried to run on CPU, it runs slower compared to GPU (it's OK), but it also  runs slower than CPU serial version (the CPU serial version, I made it myself). The length of vector to reduce is 65536.