I found the bug and I have a workaround.
I have some code that does:
dstColor = (pkColor & (~fbmask)) | (dstColor & fbmask);
The ISA disassembler shows the compiler cleverly used v_bfi_b32 which implements vdst = (vsrc1 & vselect1) | (vsrc2 &~vselect1), but it has registers mixed up. I get the opposite result of what it should be.
If I instead use dstColor = bitselect(pkColor, dstColor, fbmask) then it works correctly.
So... I think there's some bug in whatever peephole optimizer is generating v_bfi_b32. I will try to make a small test case.
Please see test case and results. It reproduces the problem on APP SDK Runtime 10.0.1124.2, but it's OK on 10.0.1084.4.
kernB is how the output should look.
You can that in kernA instead of...
d = (a & (~c)) | (b & c)
I think it's doing...
d = (c & (~c) | (a & c)
The problem is at the IL stage.
I got yesterday exactly the same compiler issue for Tahiti GPU (please look at http://devgurus.amd.com/thread/166777) with a same kind of operation used in MD5 algorithm.
I hope kernel compiler will be working soon ..... so hopefully bitselect seems to be another workaround simpler than the one I found.....
Do you know where to get kernel IL/ISA assembler opcodes/instructions spec somewhere ??
I work on embedded systems on other architectures and I use to cope with compiler issues .... ( ) so for me having instruction/opcode explanation + usage of codeXL to debug would be helpful in the future !!
You should be able to get AMD IL spec, as well ISA docs for all AMD GPUs at http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-...
Have you checked the official CodeXL documentation? In case you have some suggestions, please raise a issue in CodeXL specific forum area.
Thanks, I indeed found what I was looking for :
Thanks to the workarounds, my MD5 algorithm is now working on GPU .... I am now in process to add optimizations and parallelize digest encoding ....