4 Replies Latest reply on Aug 17, 2016 1:08 AM by joej

    OpenCL Driver Bug FuryX 32bit

    joej

      I get wrong results with a simple prefix sum with 32 bit version on Fiji, 64 bit works, Tahiti works with both.

       

      I created a minimal project to reproduce:

      GitHub - JoeJGit/OpenCL_Fiji_Bug_Report: Expose a 32bit driver bug

       

      Project is in the zip file (don't know how to upload a complete directory).

      Please take a look, i have a similar problem with Vulkan but was not yet able to reproduce it in a small test. See bug report.txt for details.

        • Re: OpenCL Driver Bug FuryX 32bit
          dipak

          Hi Joe,

          Thanks for reporting the issue.

          It seems a compiler optimization issue. I can see the same error even for x64 build on my Hawaii card. After some experiments, I got below workarounds:

          1) Disable the optimization during kernel build i.e. pass optimization flag "-O0" or "-cl-opt-disable"

          OR

          2) In PrefixSum() kernel (prefix_sum.cl), declare the following variables inside the loop as "volatile":

          for (uint step = 0; step < 8; step++)

          {

          uint mask = ..

          uint rd_id = ...

          uint wr_id = ...

          ....

          }

           

          Could you please try the above workarounds and share your observation?

           

          Regards,

            • Re: OpenCL Driver Bug FuryX 32bit
              joej

              Hi Dipak,yes both suggested workarounds work for me.

               

              Disabling optimization also solved another issue just showed up in x64 for me too.

              That's a nice stress test because it's a graphics app processing 60000 workgroups per frame, so i can see it keeps stable with big workloads over time.

              Workgroups are either 64, 128 or 256 threads wide, and i need to disable optimizer only for 128 & 256 groups.

              Let me know if you want me to track down the origin of this different bug, maybe i can create a second test case.

               

              Do you think the same compiler issue can explain similar bugs in Vulkan?

              In Vulkan the behaviour is very different: No bugs show up for about 10 - 30 frames, then they start popping up with increasing frequency.

              And if i remember correctly, bugs also happen with workgroup size of 64, so it's not necessarily just a wavefront sync issue.

               

               

              EDIT:

              The Vulkan bug has magically disappeared. I did not change the shader and can't remember any relevant changes in the project - no clue why it's gone.

              My guess is that another shader executed before and replaced in the meantime may have caused some kind of corruption - or something completely different...

                • Re: OpenCL Driver Bug FuryX 32bit
                  dipak

                  Thank you Joe for the confirmation. I'll open a ticket for that optimization issue.

                   

                  Let me know if you want me to track down the origin of this different bug, maybe i can create a second test case.

                  Sure, you can share the test-case. I would encourage you to create a new thread for the second one if it's a different bug. It would help us to track in future.

                   

                  Regards,

                    • Re: OpenCL Driver Bug FuryX 32bit
                      joej

                      It turned out my second OpenCL bug was my own fault.

                       

                      One more issue, maybe not worth it's own thread:

                      I often use half floats to compress data in LDS.

                      Mostly this gives me a speedup close to 2 as it helps to increase occupancy.

                      But on complex shaders VGPR usage increases by large numbers (according to CodeXL), and copression causes slow down.

                      I looks like the compiler does not free temporary registers used for the conversation by convert_float4.

                       

                      The same code in Vulkan still shows improvement using compression, Vulkan shader is 4 times faster than OpenCL kernel.

                      (Vulkan is generally faster, but usually about 10-20%)