Because one thread runs really slow. GPU is only fast, if you can have many hundreds threads running concurrently compensating for that slowness of individual threads.
That and that one thread isn't a thread at all. 64 threads is a thread (I'm losing my multi-year battle against the abuse of the word thread, I realise). So if you don't use 64 threads together you're only using 1/64 of a thread, which is silly.
In this case anyway the answer is "it depends". A global memory reduction in one work item will be horrible, because it will be issuing a sequence of reads in turn which will have horrible dependencies and you'd be latency bound unless you write it cleverly. If you do a read across the wavefront from global and reduce in LDS that might work.
For a min operation we have very fast interfaces on LDS to perform the reduction and general on AMD hardware integer reductions in LDS are best performed using atomics. Floating point you have no choice because nobody puts floating point reduction units on memories, you have to do a full memory read/write cycle through the ALUs (which may be hidden inside an atomic instruction) which is slower.
For global memory atomics may or may not benefit you. They change the paths we have to use through the memory system and can slow down other memory operations, depending on compiler analysis. They have to do work out in the memory system and it is probably faster to cache data in LDS, use atomics there, and then write out.
As I said, though, the answer it very much "it depends". Any strange combination *might* be faster, even if it seems like it should not, because of the way it interacts with other operations. If you want an LDS reduction but are LDS constrained, then clearly that makes a difference. If you don't want an LDS reduction maybe you'll find you are register bound, or you're still overusing the LDS interfaces, or you change the type of memory operations the compiler has to generate to get efficient caching. The only right answer is to test it.
I would like to understand why atomics was used in that particular example, surely there would be different approaches for different tasks...
Let's consider LDS-based reduction only for now.
In sample each workitem in group (and to simplify lets restrict group size to single wavefront) should read value from _same_ location in LDS, compare with own register, then write to same location inside LDS. And all this should be done atomically, that is, all accesses to LDS should be serialized.
In "my" approach:
all threads write register value to _different_ places in LDS. For wavefront it should result in some burst/coalesced write, not?
then only single workitem consequentally reads all written LDS locations then compares to own register, then updates global memory (once).
That is, in both cases LDS reads will be serialized, but "my" approach allows better write pattern to LDS, not?
Please, could someone specify where will be slowdown? Lets discuss this paricular example, not reduction in general.
As I understand it was placed in manual for better understanding of how GPU work and it's not clear for me why it will work faster.
Multiple SIMD lanes writing to LDS will work in parallel if they align sensibly, yes. It's a wide memory interface.
You then have a long loop reading from LDS into an accumulation register and write that out once. You do a wide fast write to LDS and a long sequence of slow reads from LDS with a full address generation, wait, read turnaround time on each - at least a couple of instruction slots for the read, at 8 cycles per instruction slot and 64 items of data plus an instruction slot to generate the address each time (assumiing the loop is unrolled) you have 1024 cycles of latency to do that. More if the compiler doesn't parallelise the add and read correctly.
If you issue 64 atomics to LDS in parallel then the LDS interface will perform the reduction. It can do that faster without issuing instructions, no time wasted generating addresses, looping and so on. Just cycle by cycle add and return. I don't know how many cycles that takes but let's say it is 8 cycles each time instead of the 16 to include the extra instruction slot. You've already halved it. The hardware can guarantee a low latency for this operation in a way that your instruction stream may not achieve.
There is a small integer ALU on each lane of the LDS interface, you can easily imagine that being faster than reading out into a register, taking the register into the main ALU, writing that through the ALU to perform the addition, copying that back out into LDS and so on. It's just a hardware optimisation in the same way that texture filtering is.
That is, by using atomics in this case we not only ensure proper ordering but enable additional piece of hardware to do work. It's clearer now, thanks for explanation!
But this will not work with floating point min implementation, right? What preferred way to do the same for float values?
Not proper ordering, merely *an* ordering. Atomics guarantee a total order of computations but not any particular order.
For floats we don't have little FPUs on the LDS unit because they're too big, so you have to use the main ALUs. Most likely a tree reduction would be your best bet for floats. Make sure you don't do a work-optimal one but a cycle-optimal one. ie don't mask out lanes but use your entire SIMD unit together. You'll also want to drop the barriers if you want any performance out of it but have to be careful about dropping out of spec then.