Archives Discussions

vmiura · ‎04-03-2013

Hello,

I am hitting an odd bug running one of my OpenCL kernels on a new HD 7790. This is a kernel that I've verified on a HD 7770, and also on some Fermi and Kepler cards also.

After a lot of narrowing, I am strongly suspecting it's some kind of compiler bug. Unfortunately it doesn't look like CodeXL will disassemble Bonaire ISA yet so I can't confirm if it's doing something odd. I also can't debug the kernel.

Are there any known issues with register clobbering or similar? I have 'AMD APP SDK Runtime 10.0.1124.2'.

I'll try to make a standalone test, but this is the gist of the problem code:

struct MyStruct *m = (__global struct MyStruct *)(basePtr + offset);

if(m->magic != 123)
{

... dump debug diagnostics to global memory // This never happens

return;
}

if(...)
{

// loads + arithmetic

// no stores, and no touching 'm'

}

else
{

// loads + arithmetic

// no stores, and no touching 'm'
}

if(m->magic != 123)
{

... dump debug diagnostics to global memory // This always happens

return;
}

The result is that I get the dump the 2nd time I check m->magic not the 1st. Nothing should be modifying global memory here. There's just the one kernel running with clFinish before and after - and it's 100% reproducible.

I dumped 'basePtr', 'offset' and 'm' and I can see m is corrupt (m != basePtr + offset).

himanshu_gautam · ‎04-04-2013

hi vmiura,

What is the type of basePtr?

Shouldn't this statement, increment basePtr based on its older type, rather than newer type. This might be the issue.

struct MyStruct *m = (__global struct MyStruct *)(basePtr + offset);

anyways, please share the testcase, if issue persists.

vmiura · ‎04-04-2013

Hi Himanshu,

basePtr is __global uchar *. The offset in bytes is as intended.

Do you know if I can view the ISA disassembly for Bonaire somehow? It would help me confirm if it's odd code or if it might be something else.

Regards,

Victor

himanshu_gautam · ‎04-04-2013

Does you structure contain doubles? I suspect some alignment issue here. (or) Alignment mismatch between host and GPU code.

What is the data-type of "magic"?

Can you please publish your structure? We are more intersted in the data-types than the actual names. So, if it is secret, just remove the field names (except magic) and let us know.....

Also, If you are accessing the structure many times, you will be better of to use Structure of field arrays... instead of Array of Structures with fields. The former will make sure your memory accesses are coalesced.

vmiura · ‎04-04-2013

I'm not using doubles.

Here is the structure (with names changed).

struct MyStruct

{

union

{

struct

{

unsigned int magic;

short a, b;

short c, d;

unsigned int e;

unsigned int f;

unsigned int g;

int h, i, j;

short k, l, m, n;

unsigned char o;

unsigned char p;

unsigned char q, r, s, t;

unsigned char u, v;

unsigned char w;

unsigned char x;

unsigned char y;

unsigned char z, a0;

unsigned char b0,

unsigned char d0;

unsigned char e0;

unsigned char f0;

unsigned char g0[2];

unsigned char h0, i0, j0;

unsigned char k0;

};

unsigned int raw[17];

};

Actually I added "magic" so that I could track when the struct was bad. Originally I found out that my kernel was reading invalid data for field 'f', so then I added 'magic' to the head of the struct and checks in the kernel to make sure I was reading valid initialized data.

I suspect the pointer, rather than the value of the struct. I think the pointer register has gotten clobbered by earlier code. If I remove the earlier code then it works.

vmiura · ‎04-04-2013

All threads in my wavefront read from the same element in this struct so coalescing works fine I think.

What would be nice is to have a __scalar decorator added to OpenCL so that we can make full use of scalars, and wavefront constant branching in GCN .

himanshu_gautam · ‎04-04-2013

I think I am seeing some issues in your struct. I will get back on this....

Especially with its size and the alignment of uints (which require 4-byte alignment)....

Within the strucure, the alignment is fine. But your structure size seems not be a multiple of "sizeof(unsigned int)"

You may need to increase the size of the "uint array". Can you just check if your code works fine if you use "20" instead of "17"?

vmiura · ‎04-04-2013

By the way it should have been:

> unsigned char b0,c0;

Thanks for the idea. I have already tried that as I suspected a sizeof differences between host and GPU, but it doesn't seem to be that. The sizeof(MyStruct) following standard alignment rules is 68, which is 17 uints. Also the base of the struct in __global mem is 16 byte aligned.

When I hit the error check code here I grab a global atomic and dump the value of several registers to global memory which I then print on console.

if(m->magic != 123)
{

... dump debug diagnostics to global memory // This always happens

return;
}

I am saving the value of 'm', ''basePtr', and 'offset'.

I get something like:

basePtr = 0x84000060

offset = 0x0000240

m = 0xffffffff /// !!?

m was correct when I first initialize it, and it's clobbered by -1 'after the if() else statement in the middle.

himanshu_gautam · ‎04-04-2013

vmiura wrote:
Do you know if I can view the ISA disassembly for Bonaire somehow? It would help me confirm if it's odd code or if it might be something else.
Regards,
Victor

Well, i guess CodeXL/Kernel analyzer is the only way to get kernel ISA. If that cannot help you, plz share a testcase, I can try to run it here and confirm if the issue was reproducible.

vmiura · ‎04-04-2013

Hello,

I could look at the ISA using -save-temps.

The problem actually seems to be related to the "return" statement. I think the compiler must have trouble with conditional return inside a do { } while loop.

Here's the overall control flow of my kernel.

__kernel foo(__global unsigned char *basePtr, ...)

{

if()

{

do

{

__global MyStruct *myStruct = (__global MyStruct *)(basePtr + offset);

if(x)

{

...

}

else

{

...

}

if(myStruct->magic != expectedVal)

{

// Dump vars to global buffer

return; // <--- this return is mucking things up

}

while()

{

if()

{

...

}

if()

{

switch()

{

case: ...

break;

case: ...

break;

case: ...

break;

default: ...

break;

}

if()

{

...

}

} while();

}

Basically I get the debug dump if I have the "return" statement there. If I comment out the "return" then it never hits that code.

Sadly, this just shows why my debug code is not working as expected, but it's not showing why I'm getting my original bug on the 7790 .

vmiura · ‎04-04-2013

Hello,

I found the bug and I have a workaround.

I have some code that does:

dstColor = (pkColor & (~fbmask)) | (dstColor & fbmask);

The ISA disassembler shows the compiler cleverly used v_bfi_b32 which implements vdst = (vsrc1 & vselect1) | (vsrc2 &~vselect1), but it has registers mixed up. I get the opposite result of what it should be.

If I instead use dstColor = bitselect(pkColor, dstColor, fbmask) then it works correctly.

So... I think there's some bug in whatever peephole optimizer is generating v_bfi_b32. I will try to make a small test case.

Thanks,

Victor

himanshu_gautam · ‎04-05-2013

Thanks for the update.

Please send the testcase, and we will make sure that the compiler does not do this invalid optimization anymore.

vmiura · ‎04-05-2013

Please see test case and results. It reproduces the problem on APP SDK Runtime 10.0.1124.2, but it's OK on 10.0.1084.4.

kernB is how the output should look.

You can that in kernA instead of...

d = (a & (~c)) | (b & c)

I think it's doing...

d = (c & (~c) | (a & c)

The problem is at the IL stage.

Thanks,

Victor

himanshu_gautam · ‎04-05-2013

Thanks for your time and help! Will fwd to AMD engineering.

twintip31 · ‎05-26-2013

Hi Himanshu/Vmiura,

I got yesterday exactly the same compiler issue for Tahiti GPU (please look at http://devgurus.amd.com/thread/166777) with a same kind of operation used in MD5 algorithm.

I hope kernel compiler will be working soon ..... so hopefully bitselect seems to be another workaround simpler than the one I found.....

Do you know where to get kernel IL/ISA assembler opcodes/instructions spec somewhere ??

I work on embedded systems on other architectures and I use to cope with compiler issues .... ( ) so for me having instruction/opcode explanation + usage of codeXL to debug would be helpful in the future !!

Regards

David

himanshu_gautam · ‎05-27-2013

You should be able to get AMD IL spec, as well ISA docs for all AMD GPUs at http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-...

Have you checked the official CodeXL documentation? In case you have some suggestions, please raise a issue in CodeXL specific forum area.

twintip31 · ‎05-27-2013

Thanks, I indeed found what I was looking for :

http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.p...

http://developer.amd.com/wordpress/media/2012/10/AMD_Intermediate_Language_(IL)_Specification_v2.pdf

Thanks to the workarounds, my MD5 algorithm is now working on GPU .... I am now in process to add optimizations and parallelize digest encoding ....

Regards

David

Archives Discussions

OpenCL bug with HD 7790 (Bonaire)