I get wrong results with a simple prefix sum with 32 bit version on Fiji, 64 bit works, Tahiti works with both.
I created a minimal project to reproduce:
GitHub - JoeJGit/OpenCL_Fiji_Bug_Report: Expose a 32bit driver bug
Project is in the zip file (don't know how to upload a complete directory).
Please take a look, i have a similar problem with Vulkan but was not yet able to reproduce it in a small test. See bug report.txt for details.
Hi Joe,
Thanks for reporting the issue.
It seems a compiler optimization issue. I can see the same error even for x64 build on my Hawaii card. After some experiments, I got below workarounds:
1) Disable the optimization during kernel build i.e. pass optimization flag "-O0" or "-cl-opt-disable"
OR
2) In PrefixSum() kernel (prefix_sum.cl), declare the following variables inside the loop as "volatile":
for (uint step = 0; step < 8; step++)
{
uint mask = ..
uint rd_id = ...
uint wr_id = ...
....
}
Could you please try the above workarounds and share your observation?
Regards,
Hi Dipak,yes both suggested workarounds work for me.
Disabling optimization also solved another issue just showed up in x64 for me too.
That's a nice stress test because it's a graphics app processing 60000 workgroups per frame, so i can see it keeps stable with big workloads over time.
Workgroups are either 64, 128 or 256 threads wide, and i need to disable optimizer only for 128 & 256 groups.
Let me know if you want me to track down the origin of this different bug, maybe i can create a second test case.
Do you think the same compiler issue can explain similar bugs in Vulkan?
In Vulkan the behaviour is very different: No bugs show up for about 10 - 30 frames, then they start popping up with increasing frequency.
And if i remember correctly, bugs also happen with workgroup size of 64, so it's not necessarily just a wavefront sync issue.
EDIT:
The Vulkan bug has magically disappeared. I did not change the shader and can't remember any relevant changes in the project - no clue why it's gone.
My guess is that another shader executed before and replaced in the meantime may have caused some kind of corruption - or something completely different...
Thank you Joe for the confirmation. I'll open a ticket for that optimization issue.
Let me know if you want me to track down the origin of this different bug, maybe i can create a second test case.
Sure, you can share the test-case. I would encourage you to create a new thread for the second one if it's a different bug. It would help us to track in future.
Regards,
It turned out my second OpenCL bug was my own fault.
One more issue, maybe not worth it's own thread:
I often use half floats to compress data in LDS.
Mostly this gives me a speedup close to 2 as it helps to increase occupancy.
But on complex shaders VGPR usage increases by large numbers (according to CodeXL), and copression causes slow down.
I looks like the compiler does not free temporary registers used for the conversation by convert_float4.
The same code in Vulkan still shows improvement using compression, Vulkan shader is 4 times faster than OpenCL kernel.
(Vulkan is generally faster, but usually about 10-20%)