Hi,
Is there a way to do parallel reduction without using local memory?
I studied the sample provided in AMD APP SDK 2.4. It uses local memory and only recieve one vector as its input
I want to do parallel reduction and without using local memory. For example, the kernel receives 3 input vectors and outputs three values (each is the reduction value of each vector). Is it possible write this kind of kernel? Any hints, helps?
Thank you
Hi Micah,
I'm interesting in the first option (atomics with global memory). Can you explain more on this or point me to an example in the SDK?
I think the reduction sample in the SDK is an example of the second option (streaming model). It uses multi-stage reduction or tree reduction with local memory. Am I right? Can I implement this second option without using local memory?
Originally posted by: MicahVillmow The other option is to do it with the streaming model and do a log(N) passes and each pass reduces the data set in half.
How does this option compare in speed? Compared to the single pass atomic reduction and the local memory reduction.
Since CPUs don't have local memory like GPUs to take advantage of (although they have cashing), which method is best on the CPU?
On either architecture you want to create exactly the right number of threads/wavefronts to cover the cores, in each thread you loop until you add all the values then you do a sum across that thread's vector.
On the CPU that means maybe 8 threads, each one running an SSE vector. Do the loop as:
for( int i = 0; i < inputSize/8; ++i )
sum += (float4(input));
scalarsum = sum.x + sum.y + sum.z + sum.w
Then you need one of the 8 threads to add up those scalar sums, probably in a global buffer. That way you get the CPU to stream through memory efficiently letting the cache prefetch unit pick up the data continuously maximising bandwidth.
On the GPU you do the same thing, but remember that float4 is actually float256 (float4 in each work item, so you need to add the *64 stride across the vector of work items) and you need more than one wavefront per core because of the throughput design, so maybe 8. Because of the large vector you want to do a reduction across that vector using local memory - and probably extend this to do a reduction across multiple wavefronts in a work group still in local memory, which brings us back down to one scalar per core as a result.
That's the fastest way to do a reduction. We normally issue more than that number of wavefronts on the GPU because it's hard to work out the right number to fill the device and hence you need to do a multi-pass reduction over the resulting scalars from each wave. That's why we end up doing a tree in global memory in many cases (though this is suboptimal). The case Micah is talking about is taking that to the extreme and having each work item only take part in a final tree in global rather than either doing a local tree first to reduce global bandwidth (and dispatches) or doing a looping sum first.
Atomics in global memory will be very slow, atomics in local memory would be faster and might even be comparable with trees. Note in both cases that you can't do floating point atomics at all so you can't do float reductions that way.