These days I tried to squeeze some performance from a memory-intensive OCL kernel and went for GCN assembly. Saved a few registers here, few instructions there, got a nice occupancy and thought to have a perfect kernel. However, the performance of manually tuned kernel was awful to say at least.
I tried everything but nothing helped. After that, with some limited knowledge about assembly, I went back to OCL.
Tried a few tricks and succeed to lower the register usage and raise occupancy. I was astonished when saw that performance dropped by more than 20%!
After that, I changed the code a bit few times, (added some dummy loops to fool the compiler and waste some registers), so that main loop stays unchanged (I confirmed this by watching the assembly listing). The funny thing was that when occupancy drops from 37% to 25%, performance goes up by 20%.
After that, I decided to go back to the assembly. Took the "optimized" code and added few dummy register just to brake the occupancy. After all funny things, I wasn't surprised when performance goes up and power draw goes down (not by much though)!
After all, I decided to take that OCL low-performance kernel and disassemble it. I added a few dummy registers, broke the occupancy and assemble it. You guess what happened. The performance went up.
I found some possible explanation here: https://bartwronski.com/2014/03/27/gcn-two-ways-of-latency-hiding-and-wave-occupancy.
"I think that in general (take it with a grain of salt and always check yourself) low wave occupancy and high unroll rate is good way of hiding latency for all those “simple” cases when you have lots of not-dependent texture reads and relatively moderate to high amount of simple ALU in your shaders.
....
Furthermore, too high occupancy could be counter-productive there, thrashing your caches. (if you are using very bandwidth-heavy resources)"
In the end, I don't know if all my writings is a question or not, but I would like to hear any opinions / ideas.