Because one thread runs really slow. GPU is only fast, if you can have many hundreds threads running concurrently compensating for that slowness of individual threads.
That and that one thread isn't a thread at all. 64 threads is a thread (I'm losing my multi-year battle against the abuse of the word thread, I realise). So if you don't use 64 threads together you're only using 1/64 of a thread, which is silly.
In this case anyway the answer is "it depends". A global memory reduction in one work item will be horrible, because it will be issuing a sequence of reads in turn which will have horrible dependencies and you'd be latency bound unless you write it cleverly. If you do a read across the wavefront from global and reduce in LDS that might work.
For a min operation we have very fast interfaces on LDS to perform the reduction and general on AMD hardware integer reductions in LDS are best performed using atomics. Floating point you have no choice because nobody puts floating point reduction units on memories, you have to do a full memory read/write cycle through the ALUs (which may be hidden inside an atomic instruction) which is slower.
For global memory atomics may or may not benefit you. They change the paths we have to use through the memory system and can slow down other memory operations, depending on compiler analysis. They have to do work out in the memory system and it is probably faster to cache data in LDS, use atomics there, and then write out.
As I said, though, the answer it very much "it depends". Any strange combination *might* be faster, even if it seems like it should not, because of the way it interacts with other operations. If you want an LDS reduction but are LDS constrained, then clearly that makes a difference. If you don't want an LDS reduction maybe you'll find you are register bound, or you're still overusing the LDS interfaces, or you change the type of memory operations the compiler has to generate to get efficient caching. The only right answer is to test it.
Multiple SIMD lanes writing to LDS will work in parallel if they align sensibly, yes. It's a wide memory interface.
You then have a long loop reading from LDS into an accumulation register and write that out once. You do a wide fast write to LDS and a long sequence of slow reads from LDS with a full address generation, wait, read turnaround time on each - at least a couple of instruction slots for the read, at 8 cycles per instruction slot and 64 items of data plus an instruction slot to generate the address each time (assumiing the loop is unrolled) you have 1024 cycles of latency to do that. More if the compiler doesn't parallelise the add and read correctly.
If you issue 64 atomics to LDS in parallel then the LDS interface will perform the reduction. It can do that faster without issuing instructions, no time wasted generating addresses, looping and so on. Just cycle by cycle add and return. I don't know how many cycles that takes but let's say it is 8 cycles each time instead of the 16 to include the extra instruction slot. You've already halved it. The hardware can guarantee a low latency for this operation in a way that your instruction stream may not achieve.
There is a small integer ALU on each lane of the LDS interface, you can easily imagine that being faster than reading out into a register, taking the register into the main ALU, writing that through the ALU to perform the addition, copying that back out into LDS and so on. It's just a hardware optimisation in the same way that texture filtering is.
Not proper ordering, merely *an* ordering. Atomics guarantee a total order of computations but not any particular order.
For floats we don't have little FPUs on the LDS unit because they're too big, so you have to use the main ALUs. Most likely a tree reduction would be your best bet for floats. Make sure you don't do a work-optimal one but a cycle-optimal one. ie don't mask out lanes but use your entire SIMD unit together. You'll also want to drop the barriers if you want any performance out of it but have to be careful about dropping out of spec then.