cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

erman_amd
Journeyman III

Reduction code is slower in CPU (OpenCL) compared to CPU serial

Hi,

 

I tried the Reduction sample from AMD SDK.

__kernel

void 

reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)

{

    // load shared mem

    unsigned int tid = get_local_id(0);

    unsigned int bid = get_group_id(0);

    unsigned int gid = get_global_id(0);

 

    unsigned int localSize = get_local_size(0);

    sdata[tid] = input[gid];

    barrier(CLK_LOCAL_MEM_FENCE);

 

    // do reduction in shared mem

    for(unsigned int s = localSize / 2; s > 0; s >>= 1) 

    {

        if(tid < s) 

        {

            sdata[tid] += sdata[tid + s];

        }

        barrier(CLK_LOCAL_MEM_FENCE);

    }

 

    // write result for this block to global mem

    if(tid == 0) output[bid] = sdata[0];

}



 

When I tried to run on CPU, it runs slower compared to GPU (it's OK), but it also  runs slower than CPU serial version (the CPU serial version, I made it myself). The length of vector to reduce is 65536.

0 Likes
5 Replies
himanshu_gautam
Grandmaster

I kind of expected the same.

I think the kernel dispatch plus thread scheduling overhead as the main reason for this. Thread scheduling is expensive on CPUs and when there are so many barriers in the kernels limiting the number of threads drastically in each step, results are expected to be not impressive.  So its better to run serialized reduction on CPUs unless you have very large number of cores.

0 Likes

It's an horrendously inefficient way to do a reduction on a CPU - most of the execution time will be spent synchronising "parallel" executions when in reality a serial loop would be much more efficient. The number of work items running on each step isn't really an issue for the CPU, but the fact that it's doing one comparison, one addition addition, a state save, a branch, a state load, one comparision, one addition addition, a state save etc etc is (and that's without even including the loop logic that you'd end up with in a non-unrolled and vectorised serial implementation).

It's not a particularly efficient reduction on the GPU, either, for that matter. Most of the execution time would be spent executing control flow instructions rather than useful work. The ISA for this would be a large number of very short clauses with poor ALU utilisation.

0 Likes

have a look here:

http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study-Simple-Reductions.aspx

0 Likes

Himanshu and Lee,

Thank you for the answers. I'm in progress trying Lee's suggestion from previous thread (http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150864&enterthread=y)

I will come back after get a result.  Thanks again for help

0 Likes

Hi Erman,

We are also working on this because scan/reduction is important for CPU too.

You can check our project and maybe help us to improve the library :

http://code.google.com/p/clpp/

This way you will also benefit of each improvement we do.

Krys

0 Likes