5 Replies Latest reply on Jun 10, 2010 2:29 PM by jcpalmer

    clFlush efficiency

    Zoltan.Maric

      I recently learned that enqueuing non-blocking commands requires explicit calling of clFlush.

      My question is: Is it more efficient to load the queue with all the commands I wish to dispatch - and flush at the end, or is it better to flush after each command? The commands being

      • write to input buffer (size 225280B)
      • execute 2 kernels working on the same read-only buffer
      • reading from the 2 result buffers (sizes 880B and 14080B)
        • clFlush efficiency
          nou

          IMHO in current implementation is clFlush() == clFinish()

          but i think that calling clFlush() after block of commands is better.

            • clFlush efficiency
              jcpalmer

               

              Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()

               

              but i think that calling clFlush() after block of commands is better.

               

              Yes, it seems that NVidia is currently the only implementation where clFlush does not block, making clFinish just a clFlush wrapper.  OSX is the same as ATI.  IBM's current behavior is unknown.

              In your situation, there is probably not a big difference, but it seems like just trying both ways would not be that hard.

              The big impact of the ATI & OSX implementations is if you wish to have a multi-GPU application that is portable, it must use a thread per device approach, or it is just going to toggle between blocking on the devices.

              • clFlush efficiency
                Zoltan.Maric

                 

                Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()

                 

                but i think that calling clFlush() after block of commands is better.

                 

                So let me get this straight: enqueuing non-blocking functions does not set them off, clFlush sets them off, but also blocks? How are you supposed to implement a non-blocking function then?

                Also, my tests have shown that what you said may not be true. I called clFlush after queuing all the commands needed. Did some work in the host program, and then called clWaitForEvents just before I needed the results.

                I timed the duration of the clWaitForEvents call, and it turned out to be around 4-5 ms. When I remove flushing at the end of enqueuing, clWaitForEvents blocked for 14-15 ms. So you could say I managed to do what I intended.

                Now I've also timed the duration of clFlush and clFinish at the end of enqueuing:

                • clFlush: 0-1 ms
                • clFinish: 15-16 ms

                I believe I have presented enough evidence to prove that clFlush is NOT a wrapper for clFinish or vice versa

                  • clFlush efficiency
                    Zoltan.Maric

                    BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.

                    And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command

                      • clFlush efficiency
                        jcpalmer

                         

                        Originally posted by: Zoltan.Maric BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.

                         

                        And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command

                         

                        I am glad you got your answer, and clFlush was not observed blocking.  For stuff that is easy to experimentally test, it is usually a good idea to do so yourself.  Asking on a forum is also good, but not always definitive or sometimes the info is stale (especially searching old threads).  Sometimes for tough problems though, forum feedback is difficult to ignore.

                        I went to double check I had my Netbeans CPU profiling results for this on OSX, but could not find them.  (Netbeans has a great profiler that can find hotspots, and hierarchically track cpu and # of calls by method, by thread.  It displays the hierarchies in a tree / table format, which is great for drill down.  Only problem, you cannot give the results a name on save unless you are saving externally to the project, so I do not keep them long.)

                        Re-trying, I did not observe OSX clFlush effectively blocking.  No doubt few on this forum care what that platform does, but I like to correct myself as neccessary.  A search engine would still pick this up.