I have a function A which I use in 2 separate Kernels.
I am analyzing Kernels with CodeXL Analyzer. Kernel 1 is using 227 VGPRs and I am trying to reduce it somehow.
When I comment out function A in Kernel 2, Kernel 1's VGPRs jump from 227 to 247. How can that possibly be?
I am changing one kernel and the other kernel's counters change. Kernel 2 is just a test and is never run.
I realized the same thing happens when I copy the function code inside kernel directly.
Sounds like if there is only one instance of that function is used, the function is inlined. Otherwise it is not. For some reason, when the function is inlined, extra 20 VGPRs are used.
I agree that doesn't sound right.
The analyzer however performs static analysis only: the assumption is that all kernels will run. Whatever you run kernel2 in your app or not is irrelevant to it.
In my experience I've seen func calls being inlined more than once; it likely depends on the driver. I couldn't figure out what was really going on and I would appreciate some guidelines about that.
The AMD compiler is very lazy with evaluations. It will save temporaries at most occasions, even when they're just an x/2. I speculate the compiler was NOT doing that for func calls but it does for inlines as their saved temporaries get to the "main" private pool.