3 Replies Latest reply on Oct 12, 2015 7:06 PM by nibal

    clEnqueueCopyBuffer performance bug - 4bytes in 8ms

    tomer_gal

      Hi,

      As I am currently investigating performance variance for one of my clients, it seems that the root cause is a very large variance and slowdown for the clEnqueueCopyBuffer.

      Attached is a screenshot where 4bytes copying on the GPU consumes 8ms. That's obviously a performance bug.

      And it's not happening on the beginning of the processing so it's not related to any kind of warm up.

       

      AMD Bug.jpg

       

      Regards,

      Tomer Gal, CTO at OpTeamizer

        • Re: clEnqueueCopyBuffer performance bug - 4bytes in 8ms
          nibal

          Is this just the first access of these 2 buffers? How do the rest of CopyBuffers look like?

          CreateBuffer seems to be opportunistic and there could be some buffer initialization going on.

          Finally, a question that I had all along these performance threads. Could any other processes run at the same time? If this is a display card, could it be that display rendering is responsible for some of these performance variations?

            • Re: clEnqueueCopyBuffer performance bug - 4bytes in 8ms
              tomer_gal

              Hi Nibal,

              When we create the buffers we also enqueue a write to them to make sure they are actually created before we start using them, so this is not the case of lazy initialization.

              As for other processes running, that's not the case. That's an 8 core machine, the only thing running is the process running the OpenCL host code, no other time consuming process is running.

              As for a display card, that's also not the issue. The display is using the Intel iGPU while the AMD GPU is used solely for OpenCL compute.

               

              Regards,

              Tomer Gal

                • Re: clEnqueueCopyBuffer performance bug - 4bytes in 8ms
                  nibal

                  Hi Tomer,

                   

                  Thanks for the clarifications. Nicely controlled environment.

                  Have you verified profiling from your host side? It doesn't have CodeXL's resolution, but time needed to complete CopyBuffer should equal sum

                  of queueing and execution in CodeXL's profiler.

                  I have also seen weird execution times in my programs under CodexL profiler, even violating single queue prioritization.

                  I have even seen event completion before kernel has even started, so I assumed that this is a profiler issue.

                  (CodeXL is *very* buggy. Have given up raising tickets about it :-()

                  One last question: Is that the only CopyBuffer that looks like that, or are there more?