OpenCL

kbala · ‎08-24-2019

These days I tried to squeeze some performance from a memory-intensive OCL kernel and went for GCN assembly. Saved a few registers here, few instructions there, got a nice occupancy and thought to have a perfect kernel. However, the performance of manually tuned kernel was awful to say at least.

I tried everything but nothing helped. After that, with some limited knowledge about assembly, I went back to OCL.
Tried a few tricks and succeed to lower the register usage and raise occupancy. I was astonished when saw that performance dropped by more than 20%!

After that, I changed the code a bit few times, (added some dummy loops to fool the compiler and waste some registers), so that main loop stays unchanged (I confirmed this by watching the assembly listing). The funny thing was that when occupancy drops from 37% to 25%, performance goes up by 20%.

After that, I decided to go back to the assembly. Took the "optimized" code and added few dummy register just to brake the occupancy. After all funny things, I wasn't surprised when performance goes up and power draw goes down (not by much though)!

After all, I decided to take that OCL low-performance kernel and disassemble it. I added a few dummy registers, broke the occupancy and assemble it. You guess what happened. The performance went up.

I found some possible explanation here: https://bartwronski.com/2014/03/27/gcn-two-ways-of-latency-hiding-and-wave-occupancy.

"I think that in general (take it with a grain of salt and always check yourself) low wave occupancy and high unroll rate is good way of hiding latency for all those “simple” cases when you have lots of not-dependent texture reads and relatively moderate to high amount of simple ALU in your shaders.
....
Furthermore, too high occupancy could be counter-productive there, thrashing your caches. (if you are using very bandwidth-heavy resources)"

In the end, I don't know if all my writings is a question or not, but I would like to hear any opinions / ideas.

realhet · ‎08-26-2019

Here's my idea:

Make 100% sure that the modified algorithm calculates the exact same thing.

I remember fooling myself by a mistake that caused faster results because it generated much more cache hits because of a wrong calculation.

dipak · ‎08-26-2019

I would like to add few points from the "AMD OpenCL Optimization Guide" if they help.

In general, higher number of active wavefronts (or higher occupancy) helps to hide the memory latency, thus improve the overall performance. However in some scenarios, increasing occupancy may not provide any performance benefit or even can have a negative impact on the performance. Here are couple of such scenarios mentioned in the "AMD OpenCL Optimization Guide":

Increasing the wavefronts per compute unit does not indefinitely improve performance; once the GPU has enough wavefronts to hide latency, additional active wavefronts provide little or no performance benefit.

A low value for the ALUBusy performance counter (% of GPU Time vector ALU instructions are processed) can indicate that the compute unit is not providing enough wavefronts to keep the execution resources in full use. This counter also can be low if the kernel exhausts the available DRAM bandwidth. In this case, generating more wavefronts does not improve performance; it can reduce performance by creating more contention.

For more information, please refer this section: OPENCL Optimization — ROCm Documentation 1.0.0 documentation

Thanks.

OpenCL

OpenCL occupancy-performance nightmare