Hi,
I have come across a possible compiler issue which is present in the attached kernels. The two kernels differ how work is mapped to the threads. Requiring a work group size of 256; the difference in the kernels is
210,211c210,211
< int divIdx = get_local_id(0) / 8;
< int modIdx = get_local_id(0) % 8;
---
> int divIdx = get_local_id(0) % 32;
> int modIdx = get_local_id(0) / 32;
For the first version i get correct reproducible results from an Radeon 6550 for the PowerBornBBKernel, as well as an r290x. For the latter version i still get correct reproducible results from the 6550. However i get incorrect results from r290x which differ between different runs of the kernel with the same input.
Name: | Hawaii |
Vendor: | Advanced Micro Devices, Inc. |
Device OpenCL C version: | OpenCL C 1.2 |
Driver version: | 1573.4 (VM) |
Profile: | FULL_PROFILE |
Version: | OpenCL 1.2 AMD-APP (1573.4) |
Name: | BeaverCreek |
Vendor: | Advanced Micro Devices, Inc. |
Device OpenCL C version: | OpenCL C 1.2 |
Driver version: | 1214.3 |
Profile: | FULL_PROFILE |
Version: | OpenCL 1.2 AMD-APP (1214.3) |
Hi,
Please provide a test project (with host code and if required, data set) such that we can run these two kernels at our end and verify the results [also suggest how to compare the results (i.e correct or not)]. A simpler test-case with same behavior would be very helpful for us. Please also let us know other setup details such as OS, SDK, Driver version etc..
Regards,
Hi,
Due to licensing i am not able to share the original code and I havent been able to extract a test case with the same behaviour so far. I have a more detailed output from the correct and incorrect running versions, which you may want to have a look at meanwhile, because i dont really understand the behaviour i observe there. The output comes from two additional print statements in the kernel, see the attach modified kernels. From the output it seems that either a part of the work group with modIdx == 6 does not execute, or that the other part does not respect the barriers.
Also note that the code fails to run Test 1, but passes Test 2 using the same kernel file.
The system producing the incorrect results runs:
lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.0.1406 (Core)
Release: 7.0.1406
Codename: Core
Name: | Hawaii |
Vendor: | Advanced Micro Devices, Inc. |
Device OpenCL C version: | OpenCL C 1.2 |
Driver version: | 1573.4 (VM) |
Profile: | FULL_PROFILE |
Version: | OpenCL 1.2 AMD-APP (1573.4) |
fglrxinfo -display
display: screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon R9 200 Series
OpenGL version string: 4.4.13084 Compatibility Profile Context 14.301.1001
here is "bare bones" version of the kernel. I hope It does not have an obvious sync bug... and repros on your end. the nested ifs and this kind of thread assignment are used to fight generous register usage in original kernel.
I expect the resulting array to have values close to 255, and in fact it contains mostly 17 and 19, with some rare cases of other combinations.
run this with offset 0 global size 256, local size 256
I've tried it on a 260x win7 64bit machine with recent codeXL and catalyst packages.
code xl 1.5.6571.0
catalyst: 14.9 - downloaded from amd.com today.
__kernel void
BBKernel(__global float* result)
{
__local float4 level3[64];
int threadIdx = get_local_id(0);
if (threadIdx < 64)
{
level3[threadIdx].x = 0;
level3[threadIdx].y = 0;
level3[threadIdx].z = 0;
level3[threadIdx].w = 0;
}
barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);
const unsigned int atomCnt = 288;
int divIdx = (threadIdx & 0x1f);
int modIdx = (threadIdx >> 5);
for(unsigned int atomid = 0; atomid < atomCnt; atomid += 32)
{
if (modIdx <4)
{
if (modIdx <2)
{
if (modIdx == 0)
level3[divIdx].x = 1;
else
level3[divIdx].y = 2;
}
else
{
if (modIdx == 2)
level3[divIdx].z = 4;
else
level3[divIdx].w = 8;
}
}
else
{
if (modIdx < 6)
{
if (modIdx == 4)
level3[32 + divIdx].x = 16;
else
level3[32 + divIdx].y = 32;
}
else
{
if (modIdx == 6)
level3[32 + divIdx].z = 64;
else
level3[32 + divIdx].w = 128;
}
}
barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);
if (threadIdx < 32)
{
float br = ( level3[threadIdx].x
+ level3[threadIdx].y
+ level3[threadIdx].z
+ level3[threadIdx].w
+ level3[32 + (threadIdx)].x
+ level3[32 + (threadIdx)].y
+ level3[32 + (threadIdx)].z
+ level3[32 + (threadIdx)].w);
result[atomid + threadIdx] = br;
}
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
}
}
Hi,
We finally have a reproducible test case with host code which produces the expected results on the CPU, but produces incorrect results on the GPU. Interestingly, the thread distribution
int divIdx = get_local_id(0) / 8; | |
int modIdx = get_local_id(0) % 8; |
which works for the larger complete kernel, also fails the simple test case.
Hi,
Thanks for posting the reproducible test case. I'm on vacation, so, I'll try to test it next week. Meanwhile, if you have any update, please post.
Regards,
Hi,
any update on this? have you been able to repro the bug on your side?
My apologies for this delay.
I ran your code on following setups- 1) CPU, 2) Devastator (integrated-GPU) and 3) Capeverde (HD 7770). Please find the output of each setups attached herewith.
I guess, outputs of setup-1 and 2 are okay, but not setup-3. Output of both the kernels for setup-3 are different and neither of them matches with setup-1 and 2.
Can I assume setup-3 as the erroneous case that you wanted to point out? If so, I'll file an internal bug report against this test case.
Note:
Machine setup:
AMD A10-6800K APU
HD7770 D-Gpu
Windows 7 (64 bit)
Catalyst Driver: 14.9.2 beta
APP SDK 2.9-1
Regards,
thanks for looking into this.
Yes, 7770 output is wrong. I also have seen a different kind of wrong output on Bonair, and 290X
Thanks for this confirmation. An internal bug report has been filed against this issue. I'll let you know if get any update.
Regards,
Inwhat driver version this bug expected to be fixed?
We have issue described in this thread: Kernel with local memory usage gives different results on some hardware where symptoms very similar to those described here.
Similarly, older GPUs have no issues while some GCN models fail to compute properly on workgroups bigger than single wavefront.
Barriers involved. Please fix this issue ASAP, it was reported few months ago...
Hi Raistmer
As I checked, the bug has not been fixed yet. Our team is working on it. Sorry for this delay. Please keep patience.
Regards,
I have reports that bug affected my app fixed in 15.4 beta for Windows and 15.3 for Linux.
Would be good to check this one against those drivers too.
Thanks for the information. Will check this issue on 15.4 beta.
Hi,
Here is an update about this issue. The engg. team have identified root cause of the issue. The problem is due to a race condition where multiple work items modify the same vector at the same time. The compiler does optimizations under the assumption that there is no race condition. If a program contains race condition, the result is undefined based on what optimization is done. It may work or may not. In order to avoid such race condition, one should use some kind of prevention such as volatile type, atomic_store, or other atomic methods etc. So, please modify the code accordingly.
Regards,