erman_amd

Reduction code is slower in CPU (OpenCL) compared to CPU serial

Discussion created by erman_amd on Jun 12, 2011
Latest reply on Jun 14, 2011 by spectral

Hi,

 

I tried the Reduction sample from AMD SDK.

 

__kernel

void 

reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)

{

    // load shared mem

    unsigned int tid = get_local_id(0);

    unsigned int bid = get_group_id(0);

    unsigned int gid = get_global_id(0);

 

    unsigned int localSize = get_local_size(0);

    sdata[tid] = input[gid];

    barrier(CLK_LOCAL_MEM_FENCE);

 

    // do reduction in shared mem

    for(unsigned int s = localSize / 2; s > 0; s >>= 1) 

    {

        if(tid < s) 

        {

            sdata[tid] += sdata[tid + s];

        }

        barrier(CLK_LOCAL_MEM_FENCE);

    }

 

    // write result for this block to global mem

    if(tid == 0) output[bid] = sdata[0];

}



 

When I tried to run on CPU, it runs slower compared to GPU (it's OK), but it also  runs slower than CPU serial version (the CPU serial version, I made it myself). The length of vector to reduce is 65536.

Outcomes