GPU Developer Tools

dan5sch · ‎08-11-2022

Hello. I'm looking for insight into what factors can affect the latency of export instructions in fragment shaders compiled for RDNA1, and whether the export instruction latencies reported in Radeon GPU Profiler are meaningful as I optimize my shaders or are just a red herring. TL;DR: two versions of a fragment shader report latencies on their export instructions that differ by a factor of *ten*.

I'm using RGP to compare the performance of two expensive fragment shaders (before and after a rewrite to reduce register pressure and optimize the algorithm) on an RX 5700 XT. The rewritten shader consumes different buffer and image objects but logically is operating on the same geometry as before; the two shaders were profiled for the same camera position and ran on the same number of fragments. The inner loops differ, but the overall shape of the shaders are the same:

Read UBO and per-fragment SSBO data
Perform tight inner loop with texture reads and some SSBO reads
After the loop, output per-fragment depth and a single output vec4, or discard if a condition is met (more likely after larger numbers of iterations). The shaders only output to the framebuffer.

The rewritten shader shows fewer registers used as intended, and this has the expected effect on s_waitcnt instructions: the total number of clocks is similar before and after the rewrite, but a larger fraction of the latency is hidden by other waves. Between the higher occupancy and other algorithm-level changes, its draw call completes in close to half the time. However, for some reason, the reported latency of the export operations(*) themselves explodes from ~1200 clk (normalized by hit count) to ~17000 clk. I also noticed that the first executions of s_waitcnt_lgkmcnt(0) differ in latency by an even more extreme ratio (though, in this case, the new shader does perform a larger number of scalar reads at the start).

My question, then: is this increase in reported latency an indication that I am doing something suboptimal in the rewritten shaders that should be fixed for improved performance, or are these latency numbers right before the end of the program reflective of other aspects of the GPU state that are out of my control and/or irrelevant to the performance of the shader?

(*) the ISA contains *two* branches that export and end program, one exporting color and depth and the other exporting nothing; I assume this is a compiler-generated fast path for waves where all fragments are discarded. But both branches show high instruction latencies in the old and new shaders, and the roughly factor-of-ten growth from old to new is found in both branches.

dan5sch · ‎08-11-2022

Of note, the instruction timing section of the RGP docs does mention export instructions when discussing instruction latency:

In the below image, we see three export instructions. The exp pos0 has a rather long interval of 16,616 clocks. This can be expected since the exp pos0 instruction’s issue can be delayed for reasons such as unavailable memory resources which may be in use by other wavefronts. As a result, there is a long duration in the instruction. Since the latency waiting for memory resources was seen for the first export instruction, the subsequent export, exp param0 has a much shorter duration

It's relieving to see latencies in the tens of thousands of clocks mentioned as not being abnormal, but I'm still surprised that the export latencies can differ so much for two shaders processing the same number of fragments over a similar (factor of two difference) execution time.

dipak · ‎08-12-2022

Hi @dan5sch ,

Thank you for posting it. I have whitelisted you for the Devgurus community.

Please note, for any query or issue related to RGP and other GPUOpen tools, you can use their respective github sites which provide all the support. Here is the RGP github page: https://github.com/GPUOpen-Tools/radeon_gpu_profiler/issues

Thanks.

GPU Developer Tools

Extreme variation in export instruction latency in RGP (RDNA1)