Archives Discussions

mrrvlad · ‎07-10-2014

Hi,

I'd like to surface this question once again.

Is a compiler feature to specify maximum register usage by a kernel, similar to maxregcount for cuda, planned for a future release?

Currently we are investigating whether OpenCL is worth looking at for datacenter compute scenarios VS cuda and this came up as the worst blocker. For more advanced kernels it becomes beneficial to spill or recompute data in registers to increase occupancy.

If maxregcount is not used with CUDA, then openCL port and cuda implementation perform about the same and use about 35% of available Tflop/s. When maxregcount is specified to allow 100% occupancy for NVidia card, the kernel is able to use 85% of available compute. While one may try to write more optimized code, it's hard to do due to lack of feedback on register usage by different parts of kernel on AMD and it's not something we would spend time on unless we have to.

Is it possible to get an exact answer from an insider if this feature is planned and if we can get a beta drop in near future, before we have to finalize technology decision?

dipak · ‎07-16-2014

Hi,

My apologies for this late reply.

As I've been informed, there are no immediate plans for controlling VGPR usage by using compiler flag directly. But optimizing VGPR usage gets discussed in the compiler team every once in a while, so I suspect that the feature may be included in the not-too-far future.

The good news is that a future CodeXL release will include the following features to help developers:

Highlight the resource that constrains the number of in-flight wavefronts. This is the focus of the revised Statistics view in CodeXL’s Analyze mode.
Highlight the connection between a OpenCL source line and the corresponding generated IL and ISA blocks (pending driver development to support this feature)

Hope my feedback will be helpful to you to make your decision.

Regards,

jason · ‎01-22-2015

bumping - isn't it time to release this yet? Nvidia's ancient opencl implementation actually allows this - shouldn't that make you be seeing green (hehe)?

Seriously though the compiler still does alot of nutty things with innocuous line changes - being able to constrain the registers used seems to be the only way to coral it around.

I still have no idea what code really causes these VGPR allocations either and know of no way to narrow in on it. As for the resources that constrain the number of in-flight wavefronts, that's definitely VGPRs!

omnidirectional · ‎02-09-2015

Double Bump My team is beginning to appreciate the difficulty of managing Registers. We have tried to shift variables to LDS, and that has resulted in increasing our VGPRs from 55 to 56 =:-/ At least we were able to make a change to VGPR usage. With a little more info maybe we can change it in the right direction.

mrrvlad · ‎02-09-2015

eh, It should not be developer's job to fight with compiler

For me marginally useful things have been:

1) moving the code into/out of a function call - apparently Opencl from amd is not really inlining them.

2) change code to use more SPGRs even if it's less efficient - this was the only reliable way to reduce VPGR use aside from total algorithm rewrite.

3) add if statements change per thread control flow in a useless, but non-obvious to compiler way. Sometimes it will change strategy and you get a drastically different reg count. The biggest issue is that this state is very easy to break with other changes.

unfortunately intel is going the same way by not allowing to select SIMD width at compile time(and register limit that comes with it) and you have to rely on compiler heuristics.

jason · ‎02-09-2015

mrrvlad wrote:

eh, It should not be developer's job to fight with compiler

For me marginally useful things have been:

1) moving the code into/out of a function call - apparently Opencl from amd is not really inlining them.

2) change code to use more SPGRs even if it's less efficient - this was the only reliable way to reduce VPGR use aside from total algorithm rewrite.

3) add if statements change per thread control flow in a useless, but non-obvious to compiler way. Sometimes it will change strategy and you get a drastically different reg count. The biggest issue is that this state is very easy to break with other changes.

unfortunately intel is going the same way by not allowing to select SIMD width at compile time(and register limit that comes with it) and you have to rely on compiler heuristics.

Re fighting with the compiler, I think that is the ideal but the reality has been for years to fight with it on nvidia's implementation as well as similarly fight with tools on FPGAs over similar resource issues - it doesn't seem like it's going to be a solved problem for many years to come. I think a tool like req_work_group_size would fit the bill best - allowing VGPR and SGPR constraining - it'd be lovely to add this as a language extension so we can control it with fine granularity rather than compiled files. I've also noted somewhere else but just to nip counterpoints in the bud if AMD people read this, we all know constraining registers will not necessarily/automagically increase performance but it is an important constraint to toggle and can have drastic effects given the right functions (esp those which the compiler inexplicably sucks for) can lead to it. It's an important knob/bandaid tweaked only when the compiler isn't doing what it should be doing.

Another thing I've found in some different places is that mad24/mul24 can be used to help prevent register usage in offset/index calculations.
Surprisingly #pragma unroll'ing can also notably reduce usage OR increase it - most of the time it seems to reduce it.
Use more compile time constants too, through macros rather than passed as parameters - this makes you lose a little bit of genericity in kernel functions when things like dimensions change but its not too bad given you already have a JIT compiler in many environments. enums also work for this purpose much like how they are abused in C++ templates for the same purposes.

mrrvlad · ‎02-10-2015

jason wrote:

Re fighting with the compiler, I think that is the ideal but the reality has been for years to fight with it on nvidia's implementation as well

nvidia implementation accepts -cl-nv-maxrregcount parameter and respects it since 2010.

jason · ‎02-10-2015

as I indicated in my first post, yes - team green has supported this since forever. What I had meant is that you must corral the compiler into doing the right thing through coarsely constraining it and that seems to be a common thing on both GPGPU and related toolchains (including OpenCL compilers) like on FPGAS.

maxdz8 · ‎02-10-2015

Let me be clear. I have posted this only to make sure AMD recognizes more and more people is getting on this.

realhet · ‎02-09-2015

When swapping data out and back, make sure to calculate the addresses in a different way or else the compiler can uses registers to temporarily store those actual byte offsets. A parameter whose value is set to zero from the host can be handy for this. I've noticed this while using memory, but it can work on LDS as well.

maxdz8 · ‎02-10-2015

omnidirectional wrote:

We have tried to shift variables to LDS, and that has resulted in increasing our VGPRs from 55 to 56 =:-/ At least we were able to make a change to VGPR usage. With a little more info maybe we can change it in the right direction.

I can unfortunately confirm this.

I have a very rough understanding of ISA but it seems to me the compiler is too aggressive in VGPR'ing every value it thinks will be reused as confirmed by realhet.

This is especially dumb when VGPR scarcity prevents VALU to be saturated, at which point recomputing them every time would probably be faster.

I have so far used uniforms kernel parameters in a similar way albeit not for this specific problem. I am very positive the suggestion will reap some benefits for me!

boxerab · ‎10-04-2017

Bumping this again Any plans to control VGPR usage via command-line param ? Like nVidia does on their compiler

since 2010 ?

Archives Discussions

allow control of VGPR/ SGPR usage by kernel