Archives Discussions

erman_amd · ‎06-12-2011

Hi,

I tried the Reduction sample from AMD SDK.

__kernel

void

reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)

{

// load shared mem

unsigned int tid = get_local_id(0);

unsigned int bid = get_group_id(0);

unsigned int gid = get_global_id(0);

unsigned int localSize = get_local_size(0);

sdata[tid] = input[gid];

barrier(CLK_LOCAL_MEM_FENCE);

// do reduction in shared mem

for(unsigned int s = localSize / 2; s > 0; s >>= 1)

{

if(tid < s)

{

sdata[tid] += sdata[tid + s];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

// write result for this block to global mem

if(tid == 0) output[bid] = sdata[0];

}

When I tried to run on CPU, it runs slower compared to GPU (it's OK), but it also runs slower than CPU serial version (the CPU serial version, I made it myself). The length of vector to reduce is 65536.

himanshu_gautam · ‎06-12-2011

I kind of expected the same.

I think the kernel dispatch plus thread scheduling overhead as the main reason for this. Thread scheduling is expensive on CPUs and when there are so many barriers in the kernels limiting the number of threads drastically in each step, results are expected to be not impressive. So its better to run serialized reduction on CPUs unless you have very large number of cores.

LeeHowes · ‎06-13-2011

It's an horrendously inefficient way to do a reduction on a CPU - most of the execution time will be spent synchronising "parallel" executions when in reality a serial loop would be much more efficient. The number of work items running on each step isn't really an issue for the CPU, but the fact that it's doing one comparison, one addition addition, a state save, a branch, a state load, one comparision, one addition addition, a state save etc etc is (and that's without even including the loop logic that you'd end up with in a non-unrolled and vectorised serial implementation).

It's not a particularly efficient reduction on the GPU, either, for that matter. Most of the execution time would be spent executing control flow instructions rather than useful work. The ISA for this would be a large number of very short clauses with poor ALU utilisation.

laobrasuca · ‎06-13-2011

have a look here:

http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study-Simple-Reductions.aspx

erman_amd · ‎06-14-2011

Himanshu and Lee,

Thank you for the answers. I'm in progress trying Lee's suggestion from previous thread (http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150864&enterthread=y)

I will come back after get a result. Thanks again for help

spectral · ‎06-14-2011

Hi Erman,

We are also working on this because scan/reduction is important for CPU too.

You can check our project and maybe help us to improve the library :

http://code.google.com/p/clpp/

This way you will also benefit of each improvement we do.

Krys

Archives Discussions

Reduction code is slower in CPU (OpenCL) compared to CPU serial