8 Replies Latest reply on Oct 18, 2012 6:32 PM by Bdot

    Bug on compiler

    hojin

      I just compiled this small codes. But its binary size is too big.

      The below code is the part of my code having bug.

      This code works properly on NVIDIA or INTEL.

       

      typedef struct

      {

                short block[12][64];

      }MB_information;

      typedef struct

      {

                int block_count;

      }idct_new_params;

      typedef struct

      {

      }picture_params;

      void Saturate(short* Block_Ptr)

      {

                          Block_Ptr[63] ^= 1;

      }

      __kernel void idct_new(idct_new_params inp, picture_params pp, __global MB_information * MBinfo, __global unsigned char* frf, __global unsigned char* brf, __global unsigned char* crf)

      {

                          //int MBA=inp.MBA;

                          //int MBA= get_global_id(0);

                          int block_count=inp.block_count;

                          //MB_information MB=MBinfo[MBA];

                          MB_information MB=MBinfo[0];

                          for (int comp=0; comp<block_count; comp++)

                          {

                                    Saturate(MB.block[comp]);

                          }

                return;

      }

       

      I attatched some codes and generated IR.

      It seems that the amd opencl compiler generates garbage instructions.

        • Re: Bug on compiler
          binying

          How did you generate IR?

          I am trying to reproduce the problem.

          • Re: Bug on compiler
            Bdot

            Is there still garbage if you actually have a type for picture_params ?

            I mean, not an empty struct with no storage size?

            • Re: Bug on compiler
              Bdot

              I don't think the compiler does anything wrong here. I just had a closer look at what you're doing here: You copy the whole MBinfo array into private storage space per thread, into variable MB. This is way too much data than can be stored into registers (as you don't specify a workgroup size, the compiler has to assume max 256 threads => 256*1.5kB). Therefore, the copying of the data from global to private memory has to use scratch memory (see the isa file:

              ScratchSize      = 1196

              which is a lot). Access to scratch memory uses the same path as global memory, therefore you see tons of load and store operations, and therefore you have about 7000 instructions for this memory transfer, followed by the 12 lines of your loop that actually do the xor.

               

              I suggest to change the program to

               

              {

                      //int MBA=inp.MBA;

                      //int MBA= get_global_id(0);

                      int block_count=inp.block_count;

               

                      //MB_information MB=MBinfo[MBA];

                      __global MB_information *MB=MBinfo;

                      /* copy or add block data into picture */

                      for (int comp=0; comp<block_count; comp++)

                      {

                          Saturate(MB[0].block[comp]);

                      }

                  return;

              }

               

              (or something like that, I did not test the above). Maybe this only copies one row at a time, which may fit into registers. Or, you stick to using pointers until you actually access the array entry. The point is, that you do not have endless private space in each thread. You should also think about if each thread really needs access to the whole array. If so, then copying it from global to local memory first, and then working on it may be an option.

               

              Of course, the compiler could be smarter and not copy the whole thing to scratch memory at all, but instead use it directly. Oh no, that is not allowed as then each thread would access the same global location. Scratch memory is private to each thread ...

               

              And looking again at the program I wonder why the compiler does not replace the whole kernel by a NOP - as nothing is ever written to any output. It would be interesting if the nvidia or intel compiler noticed that, and maybe work because they don't do anything. Or if they can make use of more private memory and therefore do not need to copy the data to scratch.

               

              Anyway, private space (aka registers) is precious; use it wisely.