15 Replies Latest reply on Dec 13, 2013 3:20 PM by Bdot

    Cat13.4: How to avoid the high CPU load for GPU kernels?

    Bdot

      Hi,

       

      since 13.4 and 13.5 beta, my OpenCL GPU program consumes ~80% of one CPU core while in clFinish, waiting for a string of GPU kernels and a final clEnqueueReadBuffer. My main thread looks like this

      [code]

      ntdll.dll!NtWaitForSingleObject+0xa

      KERNELBASE.dll!WaitForSingleObjectEx+0x9c

      amdocl64.dll!clGetSamplerInfo+0x1031c

      amdocl64.dll!clGetSamplerInfo+0x101f8

      amdocl64.dll!clGetSamplerInfo+0x120ea

      amdocl64.dll!clGetSamplerInfo+0x4b51

      amdocl64.dll!clFinish+0x89

      mfakto.exe!tf_class_opencl+0xf94

      mfakto.exe!tf+0x583

      mfakto.exe!main+0x117d

      mfakto.exe!__tmainCRTStartup+0x11a

      kernel32.dll!BaseThreadInitThunk+0xd

      ntdll.dll!RtlUserThreadStart+0x21

      [/code]

      and is using 0.01% CPU.

       

      However, there is another thread:

      [code]

      amdocl64.dll!clIcdGetPlatformIDsKHR+0x3e5

      amdocl64.dll!clGetSamplerInfo+0x49cf

      amdocl64.dll!clGetSamplerInfo+0x38af2

      amdocl64.dll!clGetSamplerInfo+0x38d18

      amdocl64.dll!clGetSamplerInfo+0x504e

      amdocl64.dll!clGetSamplerInfo+0x5172

      amdocl64.dll!clGetSamplerInfo+0x1ccf

      kernel32.dll!BaseThreadInitThunk+0xd

      ntdll.dll!RtlUserThreadStart+0x21

      [/code]

      that is using ~19% CPU (76% of a core). The upper part of the stack changes - it is not stuck in clIcdGetPlatformIDsKHR.

       

      When using a CPU-hungry program to consume almost all CPU and starve my program, then this thread's CPU load goes back to almost nothing, but the GPU is not fed very well and GPU load is very jumpy between 70-98%. GPU load would normally be pegged at 100%.

       

      When rolling back to cat13.3, the program's total CPU load is at ~0.1-0.3%, and running a CPU-hog has almost no effect on my program.

       

      Is there anything special to be done on the newer drivers to make them leave the CPU alone? Is there any setting to get the CPU-behavior of the previous drivers?

       

      My environment: HD5770+Phenom II X4 955, Win7-64. I got reports that the same happens with an APU and the integrated 6550D (also Win7-64).

       

      Note: Making the final clEnqueueReadBuffer synchronous instead of the final clFinish does not change the CPU load.

       

      ... and could someone please give me a hint how I can get a proper code formatting in this forum? Thanks a lot!

        • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
          Bdot

          One more detail: when inserting a clFinish between each kernel invocation, then the CPU load issue is gone. (But I then have a performance problem in my program, and it is even more sensitive to running other CPU-intense applications.)

           

          So far my kernels use the same queue, with no synchronization events. Would a different setup here help ?

          1 of 1 people found this helpful
            • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
              himanshu.gautam

              You can try using more number of command queues. (separate queue for kernel execution, and data read and data write). Also it is advisable to use to cl_events to synchronize between multiple command queues. Not sure if this will be helpful in this scenario though.

               

              Please attach your code as a zipped file, which is more handy if the code is bigger than a few lines.

                • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
                  Bdot

                  Hi, I finally managed to extend the HelloWorld example so that it shows this problem of high CPU load when running on a GPU. For that, I used two kernels so that the output of the first is the input of the second. Both kernels do some serious calculations in order to consume some time (in this example it is totally useless. Also, my real program takes care that the output of the second kernel is not overwritten by subsequen calls - that is just here in this simplification).

                   

                  When running this test program, both kernels will be scheduled alternatingly a couple of times. Then, the main thread reaches the blocking clEnqueueReadBuffer to read the final result. There it needs to wait for the scheduled kernels to finish. While waiting, one CPU is at 100% as described in the initial post of this thread.

                   

                  When I find out how, I will attach the source and test program ... Edit: Why is this "attach" link missing when writing the answer and appears only when editing the message? Very weird forum software :-/

                    • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
                      Bdot

                      Oh, it seems this is even easier to reproduce: as soon as more than a certain number (64? or maybe 128?) of kernels are scheduled, one CPU goes high. It can even be the same kernel, without any dependency. I think, this is now clearly a bug in the drivers.

                       

                      This also means, that the following code can somewhat avoid this bug while not sacrificing too much performance:

                       

                      #define MODULUS 64

                        for (int i=0; i<256000; i++)

                        {

                            status = clEnqueueNDRangeKernel(commandQueue, createkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

                       

                          switch (i%MODULUS)

                          {

                            case 0:

                              status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, &outputEvent);

                              clFlush(commandQueue);

                              break;

                            case MODULUS/2:

                              status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

                              clFlush(commandQueue);

                              status = clWaitForEvents(1, &outputEvent);

                              status = clReleaseEvent(outputEvent);

                              outputEvent = NULL;

                              break;

                            default :

                              status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

                              break;

                          }

                        }

                       

                        /*Step 11: Read the output back to host memory.*/

                          status = clEnqueueReadBuffer(commandQueue, outputBuffer, CL_TRUE, 0, strlength * sizeof(char), output, 0, NULL, NULL);

                        if (outputEvent) status = clReleaseEvent(outputEvent);

                       

                      The loop will now try to keep a few kernels "in flight" while waiting for one of them to keep the queue short. However, for short-running kernels, the disadvantage is that the GPU can no longer be fully loaded - the queue runs empty too frequently.

                       

                      AMD, please fix the CPU-load problem so that workarounds like above are not necessary. Did I already mention that Catalyst 13.1 -13.3 did not show this problem?

                      1 of 1 people found this helpful
                • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
                  Raistmer

                  Regrading GPU app increased sensitivity to CPU load: try to set your GPU app affinity to any single core. In our experiments it helps a lot.

                  Very interesting observation regarding number of enqueued kernels and increase in CPU load.

                  Could you please specify do you use clFlush() between or only enqueuing call ?

                    • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
                      Bdot

                      Hi Raistmer,

                       

                      with the latest drivers (13.11beta9v2), the sensitivity to high CPU load has dropped significantly, that's at least what I see now, but I had to change my kernels quite a bit to make them work on this driver (don't use atomics anymore - they keep killing the driver, or causing blue screens). So I cannot say for sure that this part is OK now, but a test program has a performance loss of less than 1 % when all 4 CPUs are fully loaded, compared to an idle system.

                       

                      BTW, when the cores are already at full load working on something else, then the high number of enqueued kernels just causes 1-3%CPU load. Only if there is CPU available, it will consume it.

                       

                      The high CPU load occurs no matter if I use clFlush or not. I have two interlocked kernels, and I run clFlush after the first (and clFinish after the loop). But enqueueing all kernels at once and only running clFinish once is the same.

                        • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
                          Raistmer

                          I see, thanks. And regarding increased CPU consumption when idle - yes, I see the same om C-60 based netbook with 12.8 (?) Mobility Drivers (hard to tell what drivers are in reality, GPU-Z fails to say Catalyst number).

                          With idle CPU CPU times on benchmark much higher than on loaded system (though elapsed time less).

                           

                          P.S did not check latest beta so far, but with released 13.4 drivers my app experience slowdown comparing with 12.8 drivers. This slowdown it seems comes completely from worse OpenCL compiler though.

                          When I use pre-compiled binaries that were generated under Cat 12.8 with Cat 13.4 there is no performance drop. Strange compiler degradation

                          • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
                            realhet

                            Hi,

                            Have you tried that dumb but effective workaround when

                            - compact the job into a very long kernel (lets say 500ms)

                            - set up 2 contexts

                            - only one job per context

                            First you start a job on ctx 1, then sleep until that job almost finishes, and they start the next job on ctx2, and you can read the results from ctx1. Here you can sleep again until ctx 2 finishes and start a new one on ctx1 before processing ctx2's results.

                             

                            99 percent gpu utilization and 1 percent cpu usage. It works even on the old cal drivers where 100 percent cpu was guaranteed if you had more than 1 jobs per ctx.

                            But unfortunately it requires simple kernels with predictable running times.

                              • Re: Cat13.4: How to avoid the high CPU load for GPU kernels?
                                Bdot

                                Hi realhet,

                                this would work but does not fit my use case very well (I have difficulties predicting the kernel run time). The workaround I have implemented now adds an event to every 3rd pair of kernels and then waits for completion before scheduling the next 3 pairs. Using this, the total performance drops by ~1%, but the CPU is kept free. I will try adding the event to the second of 3 pairs, wait for the event, and then again schedule 3 pairs with the event on the second. This should generate enough overlapping to get back the 1% performance ;-)