Archives Discussions

boxerab · ‎08-26-2015

Are there guidelines for reducing VGPRS usage on AMD hardware? CodeXL is telling me

that VGPRS usage is limiting concurrency for my kernels, but I am not sure what actions to take.

realhet · ‎08-27-2015

Yes, use black magic!

Otherwise it depends on the situation. Here's an example:

The kernel reads a lot of values from ram, and then does calculations on them, finally writes them back into memory into the same addresses.

In this case the compiler will likely to reserve additional registers to remember thos addresses, so they needed to calculate only once. But for the long time consuming calculations they are just use regester space for nothing.

You can solve this by adding an input kernel parameter to the output addresses which is initialized to 0, so the compiler will think that the input and the output addresses are different, so it will not cache it into regs.

Another way to play with pragma/unroll on loops.

You can encapsulate blocks into while loops that iterate only 1 time.

Sometimes the littlest change can do the biggest difference in gprs usage.

View solution in original post

realhet · ‎08-27-2015

Yes, use black magic!

Otherwise it depends on the situation. Here's an example:

The kernel reads a lot of values from ram, and then does calculations on them, finally writes them back into memory into the same addresses.

In this case the compiler will likely to reserve additional registers to remember thos addresses, so they needed to calculate only once. But for the long time consuming calculations they are just use regester space for nothing.

You can solve this by adding an input kernel parameter to the output addresses which is initialized to 0, so the compiler will think that the input and the output addresses are different, so it will not cache it into regs.

Another way to play with pragma/unroll on loops.

You can encapsulate blocks into while loops that iterate only 1 time.

Sometimes the littlest change can do the biggest difference in gprs usage.

boxerab · ‎08-28-2015

Thanks for advice. Sounds like a frustrating experience to out-guess the compiler. Did you see improved performance when you made these hacks?

It would be nice to hear from someone at AMD about the state of their compiler: can we expect better usage of VGPRs in the future? For example, nVIDIA's CUDA
compiler has a --maxregcount flag that can force the compiler to use fewer registers; it would be useful to add this as an attribute to an OpenCL kernel.

realhet · ‎08-28-2015

It's a complicated optimizer that compiles amd_il to low level isa. And the higher level compilers (opencl->llvmir->amd_il) has limited ability to control the amd_il->isa optimizer.

Also different Catalyst versions can optimize the same code differelntly.

Although I don't know how HSA works in this matter. Maybe that's totally different and you can control it much more. (If anyone knows, I'd be interested too.)

There's a third way: You can reach reach absolute driver independent freedom of optimization when you write everything in asm. But that's more work, and a lot more opportunities to do stupid bugs.

Me and some other guys are making little assemblers. I case if you're interested, just search for "gcn asm".

boxerab · ‎08-28-2015

Sounds cool. I will take a look.

Archives Discussions

Reducing VGPRS usage