These days I tried to squeeze some performance from a memory-intensive OCL kernel and went for GCN assembly. Saved a few registers here, few instructions there, got a nice occupancy and thought to have a perfect kernel. However, the performance of manually tuned kernel was awful to say at least.
I tried everything but nothing helped. After that, with some limited knowledge about assembly, I went back to OCL.
Tried a few tricks and succeed to lower the register usage and raise occupancy. I was astonished when saw that performance dropped by more than 20%!
After that, I changed the code a bit few times, (added some dummy loops to fool the compiler and waste some registers), so that main loop stays unchanged (I confirmed this by watching the assembly listing). The funny thing was that when occupancy drops from 37% to 25%, performance goes up by 20%.
After that, I decided to go back to the assembly. Took the "optimized" code and added few dummy register just to brake the occupancy. After all funny things, I wasn't surprised when performance goes up and power draw goes down (not by much though)!
After all, I decided to take that OCL low-performance kernel and disassemble it. I added a few dummy registers, broke the occupancy and assemble it. You guess what happened. The performance went up.
I found some possible explanation here: https://bartwronski.com/2014/03/27/gcn-two-ways-of-latency-hiding-and-wave-occupancy.
"I think that in general (take it with a grain of salt and always check yourself) low wave occupancy and high unroll rate is good way of hiding latency for all those “simple” cases when you have lots of not-dependent texture reads and relatively moderate to high amount of simple ALU in your shaders.
....
Furthermore, too high occupancy could be counter-productive there, thrashing your caches. (if you are using very bandwidth-heavy resources)"
In the end, I don't know if all my writings is a question or not, but I would like to hear any opinions / ideas.
Here's my idea:
Make 100% sure that the modified algorithm calculates the exact same thing.
I remember fooling myself by a mistake that caused faster results because it generated much more cache hits because of a wrong calculation.
I would like to add few points from the "AMD OpenCL Optimization Guide" if they help.
In general, higher number of active wavefronts (or higher occupancy) helps to hide the memory latency, thus improve the overall performance. However in some scenarios, increasing occupancy may not provide any performance benefit or even can have a negative impact on the performance. Here are couple of such scenarios mentioned in the "AMD OpenCL Optimization Guide":
For more information, please refer this section: OPENCL Optimization — ROCm Documentation 1.0.0 documentation
Thanks.