I tried the Reduction sample from AMD SDK.
reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
sdata[tid] = input[gid];
// do reduction in shared mem
for(unsigned int s = localSize / 2; s > 0; s >>= 1)
if(tid < s)
sdata[tid] += sdata[tid + s];
// write result for this block to global mem
if(tid == 0) output[bid] = sdata;
When I tried to run on CPU, it runs slower compared to GPU (it's OK), but it also runs slower than CPU serial version (the CPU serial version, I made it myself). The length of vector to reduce is 65536.