3 Replies Latest reply on Jan 22, 2013 9:56 AM by carlodelmundo

    Outsmarting the OpenCL Compiler on AMD GPUs

    carlodelmundo

      Hi,

       

      I am developing an application on OpenCL on an AMD 6970 using AMDAPPSDK v2.7. 

       

      I need to be able to control the occupancy of workgroups without introducing overhead.  For example, if I declare the following:

       

      __private float occupancy_correction[20];

       

      I want the OpenCL compiler to leave it alone and allocate the necessary registers when the kernel is launched.  I've noticed, however, that since it is dead-code, the compiler will optimize it out.

       

      Is it possible to trick the compiler into unoptimizing the code and using more registers than necessary?

       

      Thanks,

        • Re: Outsmarting the OpenCL Compiler on AMD GPUs
          himanshu.gautam

          You could make the private array as volatile. Compiler will not touch it

           

          But this is not an elegant way to control occupancy. Performance might not be portable across devices.

          Using less private registers is always better, you get more occupancy.

           

          Any specific reasons for using dummy private registers?

          1 of 1 people found this helpful
            • Re: Outsmarting the OpenCL Compiler on AMD GPUs
              carlodelmundo

              Thanks Himanshu.

               

              The volatile keyword works when I make scalar values into arrays.  e.g.:

               

              float theta = ...

               

              to

               

              volatile float theta[64];

              theta[0] = ...

               

              In the example above, the compiler doesn't optimize out the unused registers which is the behavior I'm looking for. However, this only works for situations when data variables (such as theta) are referenced by the code. 

               

              Dead-code such as the example below:

               

              volatile __private float occupancy_correction[12];

               

              ... is still optimized out by the compiler.  Is there another way to achieve controlled occupancy execution? I'm profiling my kernel code in a set of distinct stages.  The conditions (such as occupancy) of each stage must match the conditions when the full kernel is profiled. 

               

              Thanks