Archives Discussions

carlodelmundo · ‎01-21-2013

Hi,

I am developing an application on OpenCL on an AMD 6970 using AMDAPPSDK v2.7.

I need to be able to control the occupancy of workgroups without introducing overhead. For example, if I declare the following:

__private float occupancy_correction[20];

I want the OpenCL compiler to leave it alone and allocate the necessary registers when the kernel is launched. I've noticed, however, that since it is dead-code, the compiler will optimize it out.

Is it possible to trick the compiler into unoptimizing the code and using more registers than necessary?

Thanks,

carlodelmundo · ‎01-22-2013

It appears that I can control occupancy by focusing on local memory usage rather than registers. e.g.: the following dead-code:

__local float4 occupancy_correction[2048];

will not be optimized out by the compiler.

View solution in original post

himanshu_gautam · ‎01-21-2013

You could make the private array as volatile. Compiler will not touch it

But this is not an elegant way to control occupancy. Performance might not be portable across devices.

Using less private registers is always better, you get more occupancy.

Any specific reasons for using dummy private registers?

carlodelmundo · ‎01-22-2013

Thanks Himanshu.

The volatile keyword works when I make scalar values into arrays. e.g.:

float theta = ...

to

volatile float theta[64];

theta[0] = ...

In the example above, the compiler doesn't optimize out the unused registers which is the behavior I'm looking for. However, this only works for situations when data variables (such as theta) are referenced by the code.

Dead-code such as the example below:

volatile __private float occupancy_correction[12];

... is still optimized out by the compiler. Is there another way to achieve controlled occupancy execution? I'm profiling my kernel code in a set of distinct stages. The conditions (such as occupancy) of each stage must match the conditions when the full kernel is profiled.

Thanks

carlodelmundo · ‎01-22-2013

It appears that I can control occupancy by focusing on local memory usage rather than registers. e.g.: the following dead-code:

__local float4 occupancy_correction[2048];

will not be optimized out by the compiler.

Archives Discussions

Outsmarting the OpenCL Compiler on AMD GPUs