cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Bdot
Adept III

Cat13.4: How to avoid the high CPU load for GPU kernels?

Hi,

since 13.4 and 13.5 beta, my OpenCL GPU program consumes ~80% of one CPU core while in clFinish, waiting for a string of GPU kernels and a final clEnqueueReadBuffer. My main thread looks like this

ntdll.dll!NtWaitForSingleObject+0xa

KERNELBASE.dll!WaitForSingleObjectEx+0x9c

amdocl64.dll!clGetSamplerInfo+0x1031c

amdocl64.dll!clGetSamplerInfo+0x101f8

amdocl64.dll!clGetSamplerInfo+0x120ea

amdocl64.dll!clGetSamplerInfo+0x4b51

amdocl64.dll!clFinish+0x89

mfakto.exe!tf_class_opencl+0xf94

mfakto.exe!tf+0x583

mfakto.exe!main+0x117d

mfakto.exe!__tmainCRTStartup+0x11a

kernel32.dll!BaseThreadInitThunk+0xd

ntdll.dll!RtlUserThreadStart+0x21

and is using 0.01% CPU.

However, there is another thread:

amdocl64.dll!clIcdGetPlatformIDsKHR+0x3e5

amdocl64.dll!clGetSamplerInfo+0x49cf

amdocl64.dll!clGetSamplerInfo+0x38af2

amdocl64.dll!clGetSamplerInfo+0x38d18

amdocl64.dll!clGetSamplerInfo+0x504e

amdocl64.dll!clGetSamplerInfo+0x5172

amdocl64.dll!clGetSamplerInfo+0x1ccf

kernel32.dll!BaseThreadInitThunk+0xd

ntdll.dll!RtlUserThreadStart+0x21

that is using ~19% CPU (76% of a core). The upper part of the stack changes - it is not stuck in clIcdGetPlatformIDsKHR.

When using a CPU-hungry program to consume almost all CPU and starve my program, then this thread's CPU load goes back to almost nothing, but the GPU is not fed very well and GPU load is very jumpy between 70-98%. GPU load would normally be pegged at 100%.

When rolling back to cat13.3, the program's total CPU load is at ~0.1-0.3%, and running a CPU-hog has almost no effect on my program.

Is there anything special to be done on the newer drivers to make them leave the CPU alone? Is there any setting to get the CPU-behavior of the previous drivers?

My environment: HD5770+Phenom II X4 955, Win7-64. I got reports that the same happens with an APU and the integrated 6550D (also Win7-64).

Note: Making the final clEnqueueReadBuffer synchronous instead of the final clFinish does not change the CPU load.

... and could someone please give me a hint how I can get a proper code formatting in this forum? Thanks a lot!

0 Likes
15 Replies
Bdot
Adept III

One more detail: when inserting a clFinish between each kernel invocation, then the CPU load issue is gone. (But I then have a performance problem in my program, and it is even more sensitive to running other CPU-intense applications.)

So far my kernels use the same queue, with no synchronization events. Would a different setup here help ?

You can try using more number of command queues. (separate queue for kernel execution, and data read and data write). Also it is advisable to use to cl_events to synchronize between multiple command queues. Not sure if this will be helpful in this scenario though.

Please attach your code as a zipped file, which is more handy if the code is bigger than a few lines.

0 Likes

Hi, I finally managed to extend the HelloWorld example so that it shows this problem of high CPU load when running on a GPU. For that, I used two kernels so that the output of the first is the input of the second. Both kernels do some serious calculations in order to consume some time (in this example it is totally useless. Also, my real program takes care that the output of the second kernel is not overwritten by subsequen calls - that is just here in this simplification).

When running this test program, both kernels will be scheduled alternatingly a couple of times. Then, the main thread reaches the blocking clEnqueueReadBuffer to read the final result. There it needs to wait for the scheduled kernels to finish. While waiting, one CPU is at 100% as described in the initial post of this thread.

When I find out how, I will attach the source and test program ... Edit: Why is this "attach" link missing when writing the answer and appears only when editing the message? Very weird forum software 😕

0 Likes

Oh, it seems this is even easier to reproduce: as soon as more than a certain number (64? or maybe 128?) of kernels are scheduled, one CPU goes high. It can even be the same kernel, without any dependency. I think, this is now clearly a bug in the drivers.

This also means, that the following code can somewhat avoid this bug while not sacrificing too much performance:

#define MODULUS 64

  for (int i=0; i<256000; i++)

  {

      status = clEnqueueNDRangeKernel(commandQueue, createkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

    switch (i%MODULUS)

    {

      case 0:

        status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, &outputEvent);

        clFlush(commandQueue);

        break;

      case MODULUS/2:

        status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

        clFlush(commandQueue);

        status = clWaitForEvents(1, &outputEvent);

        status = clReleaseEvent(outputEvent);

        outputEvent = NULL;

        break;

      default :

        status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

        break;

    }

  }

  /*Step 11: Read the output back to host memory.*/

    status = clEnqueueReadBuffer(commandQueue, outputBuffer, CL_TRUE, 0, strlength * sizeof(char), output, 0, NULL, NULL);

  if (outputEvent) status = clReleaseEvent(outputEvent);

The loop will now try to keep a few kernels "in flight" while waiting for one of them to keep the queue short. However, for short-running kernels, the disadvantage is that the GPU can no longer be fully loaded - the queue runs empty too frequently.

AMD, please fix the CPU-load problem so that workarounds like above are not necessary. Did I already mention that Catalyst 13.1 -13.3 did not show this problem?

Hi

I executed the program attached here.

My environment is APU Trinity m/c, windows7 - 64bit, with Visual studio

I executed via VS2010. the program gets hanged when clEnqueueReadbuffer () function calls. And the CPU performance will be very low like 0 to 1%.

Did you used any debugger or other tools to analyse the performace?

If so please do let me I will check with that as well.

0 Likes

Thank you for checking on this issue. It is intended that the program takes quite some time to finish, and it will wait in clEnqueueReadbuffer. As APUs often are less powerful than discrete GPUs, you may want to reduce the number of loops that are run in the kernels.

However, if you already see 0.1% CPU load while the program runs (I hope the GPU load is ~100%), then the problem may not exist on APUs, or you already run a driver newer than 13.10 that may not show the issue. I suggest for reproducing the high CPU load, use a discrete GPU with Catalyst between 13.4 and 13.10.

0 Likes

Ya i had tested with 13.11 beta driver.

I will try to get the discrete GPU and test it again and get back to you.

0 Likes

I just tried AMD_Catalyst_13.11_BetaV9.2 and the high CPU load is still reproducible on my HD5770.

Did you find the time to test this a bit more?

Thanks a lot for your help!

0 Likes

I also tried rebuilding it with APP SDK 2.9, the results are unchanged: scheduling a high number of kernels (~100 .. 200 and more) leads to high CPU load on discrete GPUs.

0 Likes

Did you have a chance to test it on a discrete GPU? While experimenting with longer-running kernels I noticed that the high CPU load already kicks in at 8 kernels "in flight".

0 Likes
Raistmer
Adept II

Regrading GPU app increased sensitivity to CPU load: try to set your GPU app affinity to any single core. In our experiments it helps a lot.

Very interesting observation regarding number of enqueued kernels and increase in CPU load.

Could you please specify do you use clFlush() between or only enqueuing call ?

0 Likes

Hi Raistmer,

with the latest drivers (13.11beta9v2), the sensitivity to high CPU load has dropped significantly, that's at least what I see now, but I had to change my kernels quite a bit to make them work on this driver (don't use atomics anymore - they keep killing the driver, or causing blue screens). So I cannot say for sure that this part is OK now, but a test program has a performance loss of less than 1 % when all 4 CPUs are fully loaded, compared to an idle system.

BTW, when the cores are already at full load working on something else, then the high number of enqueued kernels just causes 1-3%CPU load. Only if there is CPU available, it will consume it.

The high CPU load occurs no matter if I use clFlush or not. I have two interlocked kernels, and I run clFlush after the first (and clFinish after the loop). But enqueueing all kernels at once and only running clFinish once is the same.

0 Likes

I see, thanks. And regarding increased CPU consumption when idle - yes, I see the same om C-60 based netbook with 12.8 (?) Mobility Drivers (hard to tell what drivers are in reality, GPU-Z fails to say Catalyst number).

With idle CPU CPU times on benchmark much higher than on loaded system (though elapsed time less).

P.S did not check latest beta so far, but with released 13.4 drivers my app experience slowdown comparing with 12.8 drivers. This slowdown it seems comes completely from worse OpenCL compiler though.

When I use pre-compiled binaries that were generated under Cat 12.8 with Cat 13.4 there is no performance drop. Strange compiler degradation

0 Likes

Hi,

Have you tried that dumb but effective workaround when

- compact the job into a very long kernel (lets say 500ms)

- set up 2 contexts

- only one job per context

First you start a job on ctx 1, then sleep until that job almost finishes, and they start the next job on ctx2, and you can read the results from ctx1. Here you can sleep again until ctx 2 finishes and start a new one on ctx1 before processing ctx2's results.

99 percent gpu utilization and 1 percent cpu usage. It works even on the old cal drivers where 100 percent cpu was guaranteed if you had more than 1 jobs per ctx.

But unfortunately it requires simple kernels with predictable running times.

0 Likes

Hi realhet,

this would work but does not fit my use case very well (I have difficulties predicting the kernel run time). The workaround I have implemented now adds an event to every 3rd pair of kernels and then waits for completion before scheduling the next 3 pairs. Using this, the total performance drops by ~1%, but the CPU is kept free. I will try adding the event to the second of 3 pairs, wait for the event, and then again schedule 3 pairs with the event on the second. This should generate enough overlapping to get back the 1% performance 😉

0 Likes