I think I have encountered a bug in the OpenCL driver of the Evergreen family. Consider the following, trivial kernel:
float phi (float x) { return -log(tanh(x/2)); }
__kernel void bugTest( __global const float* rand, __global float* testA, __global float* testB, __local float *sumphibeta, __local short *minus, __local float *abs) {
int i = get_global_id(0);
minus=(rand<0)?-1:1;
abs=(minus==-1)?phi(-rand):phi(rand);
testA=abs;
sumphibeta=0;
testB=abs;
}
Then one would expect that testA=testB. However, it doesn't always work correct:
Given that this seems to happens consistently in the Evergreen family and not outside, and that it happens regardless of OS (32bit win7 or 64bit win7, at least) and that it is resolved by switching to CPU mode, makes me suspect that it is caused by a bug in the drivers, possibly the compiler.
I use JOCL as my front-end, so I have a Netbeans-project which triggers the bug. If you prefer plain C/C++, just run the above on an Evergreen GPU with
global_work_size = {73};
local_work_size = {73};
testA = new float[73];
testB = new float[73];
rand = {4.659722f, 9.79593f, 4.018818f, 14.205623f,
4.8614163f, 13.74736f, 7.9671173f, 7.1122317f, 2.437036f, 14.670018f,
7.5287275f, 5.4726443f, 8.878769f, 4.834208f, 2.4469802f, 15.271384f,
3.7141998f, 14.71361f, 13.636945f, 11.304411f, 11.787898f, 12.776969f,
10.551311f, 7.6994514f, 8.45033f, 7.945807f, 9.079653f, 13.230182f,
12.124534f, 8.8559675f, 11.232213f, 12.435405f, 10.152599f, 7.382857f,
6.679857f, 12.372585f, 6.143651f, 3.814587f, 6.479739f, 12.005309f,
11.770777f, 11.998981f, 11.120965f, 4.4589887f, 6.4073095f, 0.70655245f,
8.641274f, 13.5908375f, 3.871447f, 7.852525f, 9.202319f, 2.7622483f,
7.2694182f, 9.263704f, 3.6282501f, 5.9980154f, 9.159372f, 3.906349f,
8.7281275f, 9.331164f, 14.096231f, 12.455883f, 11.098762f, 2.8249257f,
8.641249f, 10.528001f, 8.406309f, 5.1229515f, 10.946483f, 10.861275f,
4.2477875f, 3.8476f, 5.520955f};
I hope I got to the right place with this bug report, since I find no other places on this site to report bugs...
--Peter
Solved! Go to Solution.
pwvdendr,
Thanks for the test case, I am able to reproduce this issue.
[Update:] I've root caused the bug to an issue in the shader compiler and have found a work-around. If you place a mem_fence(CLK_LOCAL_MEM_FENCE) after the line 'abs=(minus==-1)?phi(-rand):phi(rand);', then the program produces correct results by disabling the optimization that is mis-compiling the attached test case. Hope this helps.
Hi
I will check with shader compiler engineers.
If you can send us JOCL project with the source it will be much faster.
Thx Dov
I did, it was in the link, no? Perhaps not visible enough, here the link again: http://dl.dropbox.com/u/3060536/JOCLbug.rar
pwvdendr,
Thanks for the test case, I am able to reproduce this issue.
[Update:] I've root caused the bug to an issue in the shader compiler and have found a work-around. If you place a mem_fence(CLK_LOCAL_MEM_FENCE) after the line 'abs=(minus==-1)?phi(-rand):phi(rand);', then the program produces correct results by disabling the optimization that is mis-compiling the attached test case. Hope this helps.
Ok, great, I hope you can solve this issue. 🙂
Thanks for the workaround. Do you have any idea how long it takes approximately to fix such a bug? If it's not too long I'll just wait for it. Otherwise, can you give me some more details when exactly an additional mem_fence is needed and when not?
Usually it is a 3 month turn-around from bug fix to public release(not counting preview or one-off releases).
It looks like the mem fence will be needed when there is a conditional move with a result into local memory and then you read from the local memory location with the same index. A memory fence will be needed between the write and read.
The problem looks like it comes when the load is optimized away, it is selecting one of the operands of the ternary operator instead of the result.