AnsweredAssumed Answered

AMD OpenCL compiler wasting VGPRs like crazy

Question asked by madshi on Jan 8, 2014
Latest reply on Feb 2, 2014 by msoos

(Windows 8.1 x64, HD7770, driver: 13.251-131206a-165817C-ATI)

 

Can anybody explain to me why the following kernel compiles to 140 VGPRs and 32 SGPRs?

 

From what I can see this kernel should only consume 14 VGPRs. Ok, throw in a few temp registers for calculations and stuff. But how does the compiler manage to spend 140 VGPRs on this code?

 

This is just one example of many where I'm scratching my head wondering what's going on. Trying to reduce VGPR usage (to increase occupancy) seems to be almost impossible with anything but the most trivial kernels, because the compiler doesn't seem to follow any logic with how it spends VGPRs. At least I don't understand it. If I'm missing something important, please let me know. I'm quite willing to learn and improve my skills.

 

Thanks!

 

P.S: Please don't expect this kernel to do something too useful, I've shortened it trying to isolate the problematic code...

 

 

__kernel __attribute__((reqd_work_group_size(8, 8, 1)))

void errorDiffusion(__read_only image2d_t srcImg, __write_only image2d_t dstImg)

{

  __local float3 data[64][8];

  int index = get_global_id(0) * 8 + get_global_id(1);

  int posx = get_global_id(0) * 8;

  int posy = get_global_id(1) * 8;

 

  for (int i1 = 0; i1 < 16; i1++)

  {

    float3 pix, rounded, error;

    for (int i2 = 0; i2 < 8; i2++)

      data[index][i2] += read_imagef(srcImg, srcSampler, (int2) (posx + i2, posy)).s012 * 255.0f;

    pix = data[index][0];

    rounded = round(pix);

    error = pix - rounded;

    write_imagef(dstImg, (int2) (posx, posy), (float4) (rounded / 255.0f, 0));

    pix = data[index][1] + error * 0.777;

    data[index][0] = error *  0.7861;

    data[index][1] = error * -0.6098;

    for (int i2 = 1; i2 < 8; i2++)

    {

      rounded = round(pix);

      error = pix - rounded;

      write_imagef(dstImg, (int2) (posx + i2, posy), (float4) (rounded / 255.0f, 0));

      pix = data[index][i2 + 1] + error * 0.777;

      data[index][i2 - 1] += error *  0.0090;

      data[index][i2    ] += error *  0.7861;

      data[index][i2 + 1]  = error * -0.6098;

    }

    posy++;

  }

}

Outcomes