33 Replies Latest reply on Nov 19, 2012 6:25 AM by nou

    Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

    realhet

      Hi,

       

      I've just downloaded the new driver and noticed, that my program is became 3.7% slower than before. I have figured out that the problem is that the new driver wont allow to execute 2 kernels overlapped.

       

      Here is what worked well before cat 12.10:

      start kernel1

      at 90% of the exec time of kernel1 -> start kernel2

      at 90% of the exec time of kernel2 -> start kernel3

      and so on.

      (kernel time is around 0.4 seconds)

       

      And the way it was worked on opencl: -> make 2 contexts and run two of the overlapped kernels on each different contexts. Check the completion in a 20msec timer and launch new kernel when needed.

       

      Now this is not working with Catalyst 12.10. It's 3.7% slower now because CUes are sleeping between kernels :S. You know, it's like turning off one of the 32 CUes in a HD7970, just because those bad kernel-to-kernel transitions.

      So If anyone knows, please tell what is the proper way to keep the CUes filled with work ALL THE TIME?

       

      Thank you for answers.

        • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
          binying

          Would u mind posting ur code?

            • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
              realhet

              Sry, I can't post exact codes. If I do so, my client have to kill me . Though, I should extract the thing and make a demo project out of it...

              But the syndrome is so simple: If there's already a kernel running on the card, the api will block you when you start another one. So there will be a gap between the 2 kernels. The problem is that it worked well with previous drivers and now it seems like broken or changed somehow.

               

              BTW It seems like the same behaviour as the CAL api suffered since win_cat_12.2.

                • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                  yurtesen

                  Why dont you just enqueue your kernels and let the runtime to decide what to do with them? I think any sort of reliable synchronization between contexts is not supported by OpenCL in general?

                   

                  I am not sure if it was ever possible to run 2 kernels at the same time on the device anyway.

                  http://developer.amd.com/tools/hc/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

                  11. Is it possible to run multiple AMD APP applications (compute and graphics)

                  concurrently?

                  Multiple AMD APP applications can be run concurrently, as long as they do not access the

                  same GPU at the same time. AMD APP applications that attempt to access the same GPU

                  at the same time are automatically serialized by the runtime system.

                  Also see: http://devgurus.amd.com/message/1283414#1283414

                   

                  I think you probably would get better performance if you can queue your kernels to same command queue one after each other without waiting using events within same context.

                   

                  In general it will be more time consuming to try to wait for one event to finish and then queue an event after that. Also since you cant really check if the execution will finish after 20msec or not, I dont quite understand how you can queue another kernel in an overlapping fashion the way you described. Perhaps if you could at least provide some code fragments, it could be useful.

                    • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                      realhet

                      Hi,

                      A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

                       

                      "In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

                      Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

                        • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                          yurtesen

                          My colleague made a kernel which he enqueued thousands of times and average runtime was under 40 microseconds. (it is slighty strange since amd manuals say average latency of kernel execution is 70 microseconds on a decent GPU). Although I had a problem where my programs runtime went up when I divided work into several enqueues and used offsets. Although that was rectified slightly by modifying the kernel code a bit (later we agreed it must have been something to do with cache behavior). So, I have mixed experience with this myself. But the fastest option should be enqueuing all the kernels to command queue using single context and let the runtime to decide when to run what.

                           

                          Since I had some mixed results myself in the past, all I can say is that your approach sounds unintuitive to me. For example the problem could be caused by some improvement in the driver which protects data within a context not mixing up with another context. (It would be bad securitywise if two programs which had separate contexts could read each others leftover data right?). Therefore at least in my opinion I cant think of a reason why I would blame the driver for this.

                           

                          Did you try to profile your program to see what it is doing exactly? You never mentioned how you figured out your kernels were running overlapped at first place? (which should have been impossible as far as I know...). Maybe your original speedup was due to some other reason?

                            • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                              nou

                              70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

                                • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                  yurtesen

                                  nou wrote:

                                   

                                  70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

                                  Yes, we thought about the throughput being different but I would think there must be at least some small additional latency when a kernel ends and re-starts. I think it was surprisingly fast still. One thing is that we ran exactly same kernel on exactly same data(for testing purposes), if some kernel parameters were to be different, perhaps it could have some overhead.

                                  Anyway, this doesnt help realhet that much I guess just that kernels are able to run after each other without significant delay.

                            • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                              realhet

                              Hi,

                              A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

                               

                              "In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

                              Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

                        • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                          notzed

                          Why are you polling and sleeping?  Any mismatch of the cycle  time (which will be hard to predict) will cause idle 'stalls'.

                           

                          Just waiting for a kernel to finish and starting one straight away is quite a loss - but obviously not something you can always avoid - but that is the best case scenario possible.

                           

                          Have you tried putting each kernel on a separate thread with a separate queue?  And rather than a sleep/poll loop, just doing a blocking CL operation for the synchronisation?  I presume from your description they are independent.

                            • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                              yurtesen

                              notzed, why do you think using different queues will give better performance than queuing the kernels in the same queue?

                                • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                  notzed

                                  it's more to do with removing the sleep in the poll loop.

                                   

                                  i thought the OP was already using different queues because there's no other way to get overlapping kernel execution.

                                    • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                      yurtesen

                                      There wont be overlapping kernel execution on a single card. You can achieve it if you have 2+ cards and a queue for each card. Then different cards can run kernels in an overlapping fashion. (in different devices). Therefore, having multiple queues should not increase performance if a single device is used? I believe it would be easier for the runtime to just run the next kernel in the same queue right away, than trying to schedule which queue to run and when (intuitively) there might be additional delays when switching between queues perhaps?

                                       

                                      See:

                                      http://developer.amd.com/tools/hc/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

                                      11. Is it possible to run multiple AMD APP applications (compute and graphics)

                                      concurrently?

                                      Multiple AMD APP applications can be run concurrently, as long as they do not access the

                                      same GPU at the same time. AMD APP applications that attempt to access the same GPU

                                      at the same time are automatically serialized by the runtime system.

                                        • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                          notzed

                                          That is talking about running shader kernels and opencl kernels at the same time.

                                            • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                              yurtesen

                                              The text says "Multiple AMD APP applications". (while I can see how the title is confusing). If current AMD APP SDK supported concurrent kernel execution, this would probably be listed as a feature. It would be weird if AMD would have forgotten to mention that they implemented such an important feature. Since there is no document (afaik) which says that kernels can be run concurrently how do you come up with the idea that you can run two or more kernels concurrently on a single card? Do you have an example program which can run kernels concurrently? (on a single GPU obviously)

                                               

                                              Furthermore, I made the following tests (albeit on Cypress, perhaps I should re-test on Tahiti to be 100% sure). I have a kernel which queues a large problem in 50k sized pieces. I tried to queue the kernel to same queue 0-50k, 50k-100k... and also by using threads and multiple queues q1=0-50k, q2=50k-100k and so on (next piece goes to the queue which finishes first). Both schemes took exactly the same total runtime. As a matter of fact, the individual kernel runtimes were exactly the same. If the kernels ran overlapping, there must have been some differences. (The program was originallly designed to run multi-device and it accomplishes simultaneous runs on multiple-devices, I simply changed to code to run multi-queue on same device)

                                              Total number of worker threads 2

                                              Slave node 0 thread 0 offset 0 length 50000 events 1 time 2.07 seconds

                                              Slave node 0 thread 1 offset 50000 length 50000 events 1 time 1.86 seconds

                                              Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.72 seconds

                                              Slave node 0 thread 1 offset 150000 length 50000 events 1 time 2.10 seconds

                                              Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.95 seconds

                                              Slave node 0 thread 1 offset 250000 length 50000 events 1 time 1.87 seconds

                                              Slave node 0 thread 0 offset 300000 length 50000 events 1 time 1.60 seconds

                                              Slave node 0 thread 1 offset 350000 length 50000 events 1 time 1.15 seconds

                                              Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.97 seconds

                                               

                                              Total number of worker threads 1

                                              Slave node 0 thread 0 offset 0 length 50000 events 1 time 2.07 seconds

                                              Slave node 0 thread 0 offset 50000 length 50000 events 1 time 1.86 seconds

                                              Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.72 seconds

                                              Slave node 0 thread 0 offset 150000 length 50000 events 1 time 2.10 seconds

                                              Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.95 seconds

                                              Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.86 seconds

                                              Slave node 0 thread 0 offset 300000 length 50000 events 1 time 1.60 seconds

                                              Slave node 0 thread 0 offset 350000 length 50000 events 1 time 1.15 seconds

                                              Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.97 seconds

                                               

                                              In another test, I tried to run two copies of an app where the kernel run took ~30 seconds. One finished in 31.3 seconds, the next one finished in 62.4 seconds. (as it appears, after I queued the kernel, it did not start executing beffore first programs kernel finished).

                                                • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                                  notzed

                                                  I dunno - it was plastered all over the GCN marketing materials as a pretty big plus-point.

                                                   

                                                  From the programming guide, Section 6.11 'optimisation guidelnes for southern island gpus':

                                                  Execution of kernel dispatches can overlap if there are no dependencies

                                                  between them and if there are resources available in the GPU. This is critical

                                                  when writing benchmarks it is important that the measurements are accurate

                                                  and that “false dependencies” do not cause unnecessary slowdowns.

                                                  ...

                                                    • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                                      yurtesen

                                                      I will run tests on 7970 and return back. I also found that the guide says

                                                      AMD Southern Islands GPUs can execute multiple kernels simultaneously when

                                                      there are no dependencies.

                                                      However, as far as I understand, this does not mean that you need multiple queues (since it is not mentioned). Afaik this feature was not supported before even though hardware allowed it. Also, if it is supported, there is no documentation of how runtime checks if there are dependencies or not and how can we know why it would or not run kernels simultaneously. It might even not do it if it thinks the GPU resources are busy so it would impede performance of the already running kernel. There are no guarantees afaik. For the same reasons, it might decide to not mix kernels from different queues, since it cant tell which kernel should have precedence or maybe queues does not know each others business that well so it cant tel if there are dependencies or not and choose the safe way...

                                                       

                                                      While there are a lot of assumptions involved, my intuition says that it would be best to queue kernels in a single queue for a single device and let the runtime decide what to do with them.

                                                      • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                                        drallan

                                                        I was able to get multiple kernels to overlap (on Tahiti) but only under certain conditions.

                                                        Like Yurtesen, I tried various 2 que methods that work on multiple GPUs and they all fail.

                                                        It seems to work only with one que. (as suggested  above) Here's what seemed to work.

                                                         

                                                        1. Single que.

                                                        2. Each kernel must use different output buffers.

                                                        3. Input buffers can be the same or different on overlapping kernels.

                                                        4. Works only with a single kernel.

                                                        With 2 kernels, it only works using the same kernel source file and same kernel name. (yah, same kernel)

                                                         

                                                        I also tried with Catalyst version 12.10 drivers and it does overlap, but my test kernel ran slower on this driver.

                                                        Times for 2 kernels with and without overlap:

                                                         

                                                                                    overlap    no-overlap

                                                        Old driver (8.98??)    2.44         4.33

                                                        12.10 driver              4.23         8.20

                                                         

                                                         

                                                        drallan

                                                          • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                                            nou

                                                            it must be different buffers as otherwise it broke in order execution of kernels. without out of order queue all kernels execution on same queue are implicitly synchronized.

                                                              • Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10
                                                                yurtesen

                                                                Drallan, now I have opposite results I now tested the same test program which I ran on Cypress, now on Tahiti with 12.10 drivers. I have to say, things seem to be getting overlapped. This program has multi-queue, but exactly same input/output buffers and kernel are used (in this modified version for single-gpu multi-queue). There is 10% performance increase....? Does this mean that the SDK thinks these kernels must be independent because they are on different queues?

                                                                 

                                                                I say overlapping, because total even times go over 10.1 seconds, yet wall time is less, so multiple events must have been active at the same time.

                                                                Two queues:

                                                                ------------------

                                                                Slave node 0 thread 1 offset 50000 length 50000 events 1 time 2.64 seconds

                                                                Slave node 0 thread 0 offset 0 length 50000 events 1 time 1.84 seconds

                                                                Slave node 0 thread 1 offset 100000 length 50000 events 1 time 1.27 seconds

                                                                Slave node 0 thread 0 offset 150000 length 50000 events 1 time 1.97 seconds

                                                                Slave node 0 thread 1 offset 200000 length 50000 events 1 time 2.64 seconds

                                                                Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.76 seconds

                                                                Slave node 0 thread 1 offset 300000 length 50000 events 1 time 2.20 seconds

                                                                Slave node 0 thread 0 offset 350000 length 50000 events 1 time 1.14 seconds

                                                                Slave node 0 thread 1 offset 400000 length 30932 events 1 time 1.30 seconds

                                                                WALL time =   10.1 seconds

                                                                 

                                                                Single queue:

                                                                --------------------

                                                                Slave node 0 thread 0 offset 0 length 50000 events 1 time 1.53 seconds

                                                                Slave node 0 thread 0 offset 50000 length 50000 events 1 time 1.40 seconds

                                                                Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.27 seconds

                                                                Slave node 0 thread 0 offset 150000 length 50000 events 1 time 1.75 seconds

                                                                Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.43 seconds

                                                                Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.40 seconds

                                                                Slave node 0 thread 0 offset 300000 length 50000 events 1 time 0.95 seconds

                                                                Slave node 0 thread 0 offset 350000 length 50000 events 1 time 0.89 seconds

                                                                Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.54 seconds

                                                                WALL time =   11.2 seconds