Hello. I'm looking for insight into what factors can affect the latency of export instructions in fragment shaders compiled for RDNA1, and whether the export instruction latencies reported in Radeon GPU Profiler are meaningful as I optimize my shaders or are just a red herring. TL;DR: two versions of a fragment shader report latencies on their export instructions that differ by a factor of *ten*.
I'm using RGP to compare the performance of two expensive fragment shaders (before and after a rewrite to reduce register pressure and optimize the algorithm) on an RX 5700 XT. The rewritten shader consumes different buffer and image objects but logically is operating on the same geometry as before; the two shaders were profiled for the same camera position and ran on the same number of fragments. The inner loops differ, but the overall shape of the shaders are the same:
- Read UBO and per-fragment SSBO data
- Perform tight inner loop with texture reads and some SSBO reads
- After the loop, output per-fragment depth and a single output vec4, or discard if a condition is met (more likely after larger numbers of iterations). The shaders only output to the framebuffer.
The rewritten shader shows fewer registers used as intended, and this has the expected effect on s_waitcnt instructions: the total number of clocks is similar before and after the rewrite, but a larger fraction of the latency is hidden by other waves. Between the higher occupancy and other algorithm-level changes, its draw call completes in close to half the time. However, for some reason, the reported latency of the export operations(*) themselves explodes from ~1200 clk (normalized by hit count) to ~17000 clk. I also noticed that the first executions of s_waitcnt_lgkmcnt(0) differ in latency by an even more extreme ratio (though, in this case, the new shader does perform a larger number of scalar reads at the start).
My question, then: is this increase in reported latency an indication that I am doing something suboptimal in the rewritten shaders that should be fixed for improved performance, or are these latency numbers right before the end of the program reflective of other aspects of the GPU state that are out of my control and/or irrelevant to the performance of the shader?
(*) the ISA contains *two* branches that export and end program, one exporting color and depth and the other exporting nothing; I assume this is a compiler-generated fast path for waves where all fragments are discarded. But both branches show high instruction latencies in the old and new shaders, and the roughly factor-of-ten growth from old to new is found in both branches.