I realized the same thing happens when I copy the function code inside kernel directly.
Sounds like if there is only one instance of that function is used, the function is inlined. Otherwise it is not. For some reason, when the function is inlined, extra 20 VGPRs are used.
I agree that doesn't sound right.
The analyzer however performs static analysis only: the assumption is that all kernels will run. Whatever you run kernel2 in your app or not is irrelevant to it.
In my experience I've seen func calls being inlined more than once; it likely depends on the driver. I couldn't figure out what was really going on and I would appreciate some guidelines about that.
The AMD compiler is very lazy with evaluations. It will save temporaries at most occasions, even when they're just an x/2. I speculate the compiler was NOT doing that for func calls but it does for inlines as their saved temporaries get to the "main" private pool.