Archives Discussions

erman_amd · ‎05-22-2011

Hi,

Is there a way to do parallel reduction without using local memory?

I studied the sample provided in AMD APP SDK 2.4. It uses local memory and only recieve one vector as its input

I want to do parallel reduction and without using local memory. For example, the kernel receives 3 input vectors and outputs three values (each is the reduction value of each vector). Is it possible write this kind of kernel? Any hints, helps?

Thank you

MicahVillmow · ‎05-23-2011

In a single pass algorithm, this can be done with atomics with global memory. However performance will not be nearly as good as using local memory because you will be serializing access to VRAM and access to globals is far slower than access to local.

The other option is to do it with the streaming model and do a log(N) passes and each pass reduces the data set in half.

erman_amd · ‎05-23-2011

Hi Micah,

I'm interesting in the first option (atomics with global memory). Can you explain more on this or point me to an example in the SDK?

I think the reduction sample in the SDK is an example of the second option (streaming model). It uses multi-stage reduction or tree reduction with local memory. Am I right? Can I implement this second option without using local memory?

settle · ‎05-23-2011

Originally posted by: MicahVillmow The other option is to do it with the streaming model and do a log(N) passes and each pass reduces the data set in half.

How does this option compare in speed? Compared to the single pass atomic reduction and the local memory reduction.

Since CPUs don't have local memory like GPUs to take advantage of (although they have cashing), which method is best on the CPU?

LeeHowes · ‎05-23-2011

On either architecture you want to create exactly the right number of threads/wavefronts to cover the cores, in each thread you loop until you add all the values then you do a sum across that thread's vector.

On the CPU that means maybe 8 threads, each one running an SSE vector. Do the loop as:

for( int i = 0; i < inputSize/8; ++i )

sum += (float4(input));

scalarsum = sum.x + sum.y + sum.z + sum.w

Then you need one of the 8 threads to add up those scalar sums, probably in a global buffer. That way you get the CPU to stream through memory efficiently letting the cache prefetch unit pick up the data continuously maximising bandwidth.

On the GPU you do the same thing, but remember that float4 is actually float256 (float4 in each work item, so you need to add the *64 stride across the vector of work items) and you need more than one wavefront per core because of the throughput design, so maybe 8. Because of the large vector you want to do a reduction across that vector using local memory - and probably extend this to do a reduction across multiple wavefronts in a work group still in local memory, which brings us back down to one scalar per core as a result.

That's the fastest way to do a reduction. We normally issue more than that number of wavefronts on the GPU because it's hard to work out the right number to fill the device and hence you need to do a multi-pass reduction over the resulting scalars from each wave. That's why we end up doing a tree in global memory in many cases (though this is suboptimal). The case Micah is talking about is taking that to the extreme and having each work item only take part in a final tree in global rather than either doing a local tree first to reduce global bandwidth (and dispatches) or doing a looping sum first.

Atomics in global memory will be very slow, atomics in local memory would be faster and might even be comparable with trees. Note in both cases that you can't do floating point atomics at all so you can't do float reductions that way.

MicahVillmow · ‎05-23-2011

erman_amd,
What you need to look into is the old brook+ reduction example. It uses the second option.

The pseudo-code for the first is:
For a single thread
Read N values from global memory
reduce N to a single value, M, using F()
atomically reduce some value M with a global value J using cmpxchg.

The bottleneck in this code will be the cmpxchg as depending on how you select J it could serialize every thread on the device.

Archives Discussions

Parallel reduction on GPU/CPU using OpenCL