Archives Discussions

viscocoa · ‎04-12-2012

Hi,

The OpenCL compiler seems to be so "instable" in that a small change of the code may cause the use of VGPRs change "randomly".

Sometimes, the # of VGPRs increases after some code removed. It is unpredictable, and very annoying, especially when you are at the edge of KernelOccupancy. One more VGPRs decreases the efficiency dramatically.

I wonder if AMD provides a compilation option for optimization w.r.t the # of registers.

Thank you in advance.

MicahVillmow · ‎04-12-2012

No, we don't provide a method for doing so. I'm file a request to have this feature added.

View solution in original post

MicahVillmow · ‎04-12-2012

No, we don't provide a method for doing so. I'm file a request to have this feature added.

viscocoa · ‎04-12-2012

Hi Micah,

Thank you very much for adding a request! It would be cool if there is one.

According to my experience, repeating an action by copy and paste will increase VGPRs, while repeating the action with a for loop will not. I have no idea why it does not reuse the register and work in the same place.

Is there a guideline on how to save registers?

viscocoa · ‎04-12-2012

I think using {} to divide a kernel into blocks could give the compiler a hint that variables declared in a block could be reused out of the block. However, with {}s, the VGPRs often increase.

Using inline functions sometimes saves registers.

Any other points?

notzed · ‎04-12-2012

Using pragma unroll with small values can reduce the register load due to unrolled loops. But I've had some very unpleasant results using pragma unroll so I just don't bother using it at all any more. FWIW i've also moved beyond the micro-benchmarking stage, and for the moment simply don't have time to do this with every bit of code I write unless I hit obvious speed problems.

It depends a lot on the problem. In some cases splitting the kernel up into separate invocations made a fair difference as each was is to run with much more parallelism due to simpler kenels - and that more than outweighed any possible benefit from intermediate register storage. The compiler doesn't seem very good in general at re-using register slots after they're no longer needed, or at least is over-aggressive at trying to avoid or group memory accesses.

I'm not sure if using reqd_work_group_size or other hints affect the compiler as much as you'd hope. Since that one in particular gives a good hint at how it will be run, you'd think the compiler could do something with it to target the characterstics of specific hardware.

viscocoa · ‎04-13-2012

Hi notzed,

Thank you very much for your helpful ideas.

I also split kernels to reduce register requirements. This will incur kernel setup overhead. Have you ever measured how much time it costs for invoking a kernel?

I am aiming at 12 or more active wavefronts. The max workgroup size is 256, or 4 wavefronts. That means using reqd_work_group_size, we can ask the compiler to use 64 or less registers (if I am correct). That is far from my expectation of 20 or less registers.

Vis Cocoa

notzed · ‎04-13-2012

I haven't done much kernel overhead timings because it just hasn't been an issue for me. I'm only running maybe hundreds of iterations on video-sized data. And it's not something I can do anything about either - I base the design on the opencl capability (e.g. no global memory sharing requires a new kernel invocation, otherwise: iterate as much as possible or is efficient within the kernel), and just hope the implementation will do a good job. There's enough to worry about without changing algorithms to avoid kernel call overhead as well.

The problem I was talking about was doing some relatively simple mathematical operations on 12xHD-video-sized float data (but more than just arithmetic, transcendental/trig), so the cost of 2 vs 1 kernel call was totally insignificant. Even with the overheads of an extra kernel plus the write+read to communicate, the speed up was worth it (I can't recall the numbers).

Theorising doesn't really work well with gpu code either: you usually have to try it and see. KISS as a general rule of thumb applies though.

reqd_work_group_size is only a hint, the compiler could 'conceivably' use that hint for different things. a) just ensure it will run on the hardware, b) remove unnecessary barrier instructions, c) try to tune the code for maximum parallelism (or at least, averagely useful parallelism) on the device by limiting register usage, or even d) optimising for maximum efficiency by basing optimal parallelism on ALU+fetch patterns in the compiled routine and using that to limit register usage. e.g. you talk about aiming for 12 wavefronts but that might not be optimal for all code. I have no evidence it does this (and doubt it does, especially 'd'), but it certainly *could*.

viscocoa · ‎04-14-2012

Hi Notzed,

Thank you very much for the explaination.

When global synchronizations are required, a kernel may be invoked repeatedly. In that case, the overhead is something to be concerned.

More active wavefronts can hopefully hide memory access latency. I think it is always good to make the # of wavefronts big, if no other issues, such as LDS, are to be worried about.

Have a good weekend!

Marix · ‎04-20-2012

IMHO it would be important that such a feature would not only limit the actual register usage, but provide an option to increase the cost associated to using additional registers in the optimizer.

I am working on an extremely bandwidth bound application, in which most of the kernels are always just around the register usage limit. Often minimal changes in code will dramatically increase the the register usage, resulting in tons of scratch registers and plumetting performance. The major problem is, that the register usage characteristics in those cases highly depend on the way the code is written, and it is completely random whether #pragma unroll N, #pragma unroll 1 or duplicated code provides the better results.

Archives Discussions

Register-saving optimization options???