cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

hojin
Journeyman III

Bug on compiler

I just compiled this small codes. But its binary size is too big.

The below code is the part of my code having bug.

This code works properly on NVIDIA or INTEL.

typedef struct

{

          short block[12][64];

}MB_information;

typedef struct

{

          int block_count;

}idct_new_params;

typedef struct

{

}picture_params;

void Saturate(short* Block_Ptr)

{

                    Block_Ptr[63] ^= 1;

}

__kernel void idct_new(idct_new_params inp, picture_params pp, __global MB_information * MBinfo, __global unsigned char* frf, __global unsigned char* brf, __global unsigned char* crf)

{

                    //int MBA=inp.MBA;

                    //int MBA= get_global_id(0);

                    int block_count=inp.block_count;

                    //MB_information MB=MBinfo[MBA];

                    MB_information MB=MBinfo[0];

                    for (int comp=0; comp<block_count; comp++)

                    {

                              Saturate(MB.block[comp]);

                    }

          return;

}

I attatched some codes and generated IR.

It seems that the amd opencl compiler generates garbage instructions.

0 Likes
8 Replies
binying
Challenger

How did you generate IR?

I am trying to reproduce the problem.

0 Likes

You can get ir compile with "-save-temps" option.

0 Likes
Bdot
Adept III

Is there still garbage if you actually have a type for picture_params ?

I mean, not an empty struct with no storage size?

0 Likes
hojin
Journeyman III

Regadless of picture_params, it generates garbage.

It seems to have problem if struct, array and loop are combined.

0 Likes
Bdot
Adept III

I don't think the compiler does anything wrong here. I just had a closer look at what you're doing here: You copy the whole MBinfo array into private storage space per thread, into variable MB. This is way too much data than can be stored into registers (as you don't specify a workgroup size, the compiler has to assume max 256 threads => 256*1.5kB). Therefore, the copying of the data from global to private memory has to use scratch memory (see the isa file:

ScratchSize      = 1196

which is a lot). Access to scratch memory uses the same path as global memory, therefore you see tons of load and store operations, and therefore you have about 7000 instructions for this memory transfer, followed by the 12 lines of your loop that actually do the xor.

I suggest to change the program to

{

        //int MBA=inp.MBA;

        //int MBA= get_global_id(0);

        int block_count=inp.block_count;

        //MB_information MB=MBinfo[MBA];

        __global MB_information *MB=MBinfo;

        /* copy or add block data into picture */

        for (int comp=0; comp<block_count; comp++)

        {

            Saturate(MB[0].block[comp]);

        }

    return;

}

(or something like that, I did not test the above). Maybe this only copies one row at a time, which may fit into registers. Or, you stick to using pointers until you actually access the array entry. The point is, that you do not have endless private space in each thread. You should also think about if each thread really needs access to the whole array. If so, then copying it from global to local memory first, and then working on it may be an option.

Of course, the compiler could be smarter and not copy the whole thing to scratch memory at all, but instead use it directly. Oh no, that is not allowed as then each thread would access the same global location. Scratch memory is private to each thread ...

And looking again at the program I wonder why the compiler does not replace the whole kernel by a NOP - as nothing is ever written to any output. It would be interesting if the nvidia or intel compiler noticed that, and maybe work because they don't do anything. Or if they can make use of more private memory and therefore do not need to copy the data to scratch.

Anyway, private space (aka registers) is precious; use it wisely.

0 Likes

So I can use the kernel analyzer to compile it to get the *.isa file. What is the hardware u r using, cayman or cypress or anything else?

0 Likes

well, I can not repeat ur issue by compiling ur code to Cypress, using kernel analyzer. The generated disasembler is less than twenty lines, which I think it is normal.

0 Likes

the isa in the question's attachment is clearly GCN, and it has the name pitcairn in it ...

0 Likes