cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Bdot
Adept III

Cat13.4: How to avoid the high CPU load for GPU kernels?

Hi,

since 13.4 and 13.5 beta, my OpenCL GPU program consumes ~80% of one CPU core while in clFinish, waiting for a string of GPU kernels and a final clEnqueueReadBuffer. My main thread looks like this

ntdll.dll!NtWaitForSingleObject+0xa

KERNELBASE.dll!WaitForSingleObjectEx+0x9c

amdocl64.dll!clGetSamplerInfo+0x1031c

amdocl64.dll!clGetSamplerInfo+0x101f8

amdocl64.dll!clGetSamplerInfo+0x120ea

amdocl64.dll!clGetSamplerInfo+0x4b51

amdocl64.dll!clFinish+0x89

mfakto.exe!tf_class_opencl+0xf94

mfakto.exe!tf+0x583

mfakto.exe!main+0x117d

mfakto.exe!__tmainCRTStartup+0x11a

kernel32.dll!BaseThreadInitThunk+0xd

ntdll.dll!RtlUserThreadStart+0x21

and is using 0.01% CPU.

However, there is another thread:

amdocl64.dll!clIcdGetPlatformIDsKHR+0x3e5

amdocl64.dll!clGetSamplerInfo+0x49cf

amdocl64.dll!clGetSamplerInfo+0x38af2

amdocl64.dll!clGetSamplerInfo+0x38d18

amdocl64.dll!clGetSamplerInfo+0x504e

amdocl64.dll!clGetSamplerInfo+0x5172

amdocl64.dll!clGetSamplerInfo+0x1ccf

kernel32.dll!BaseThreadInitThunk+0xd

ntdll.dll!RtlUserThreadStart+0x21

that is using ~19% CPU (76% of a core). The upper part of the stack changes - it is not stuck in clIcdGetPlatformIDsKHR.

When using a CPU-hungry program to consume almost all CPU and starve my program, then this thread's CPU load goes back to almost nothing, but the GPU is not fed very well and GPU load is very jumpy between 70-98%. GPU load would normally be pegged at 100%.

When rolling back to cat13.3, the program's total CPU load is at ~0.1-0.3%, and running a CPU-hog has almost no effect on my program.

Is there anything special to be done on the newer drivers to make them leave the CPU alone? Is there any setting to get the CPU-behavior of the previous drivers?

My environment: HD5770+Phenom II X4 955, Win7-64. I got reports that the same happens with an APU and the integrated 6550D (also Win7-64).

Note: Making the final clEnqueueReadBuffer synchronous instead of the final clFinish does not change the CPU load.

... and could someone please give me a hint how I can get a proper code formatting in this forum? Thanks a lot!

0 Likes
15 Replies
Bdot
Adept III

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

One more detail: when inserting a clFinish between each kernel invocation, then the CPU load issue is gone. (But I then have a performance problem in my program, and it is even more sensitive to running other CPU-intense applications.)

So far my kernels use the same queue, with no synchronization events. Would a different setup here help ?

himanshu_gautam
Grandmaster

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

You can try using more number of command queues. (separate queue for kernel execution, and data read and data write). Also it is advisable to use to cl_events to synchronize between multiple command queues. Not sure if this will be helpful in this scenario though.

Please attach your code as a zipped file, which is more handy if the code is bigger than a few lines.

0 Likes
Bdot
Adept III

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

Hi, I finally managed to extend the HelloWorld example so that it shows this problem of high CPU load when running on a GPU. For that, I used two kernels so that the output of the first is the input of the second. Both kernels do some serious calculations in order to consume some time (in this example it is totally useless. Also, my real program takes care that the output of the second kernel is not overwritten by subsequen calls - that is just here in this simplification).

When running this test program, both kernels will be scheduled alternatingly a couple of times. Then, the main thread reaches the blocking clEnqueueReadBuffer to read the final result. There it needs to wait for the scheduled kernels to finish. While waiting, one CPU is at 100% as described in the initial post of this thread.

When I find out how, I will attach the source and test program ... Edit: Why is this "attach" link missing when writing the answer and appears only when editing the message? Very weird forum software 😕

0 Likes
Bdot
Adept III

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

Oh, it seems this is even easier to reproduce: as soon as more than a certain number (64? or maybe 128?) of kernels are scheduled, one CPU goes high. It can even be the same kernel, without any dependency. I think, this is now clearly a bug in the drivers.

This also means, that the following code can somewhat avoid this bug while not sacrificing too much performance:

#define MODULUS 64

  for (int i=0; i<256000; i++)

  {

      status = clEnqueueNDRangeKernel(commandQueue, createkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

    switch (i%MODULUS)

    {

      case 0:

        status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, &outputEvent);

        clFlush(commandQueue);

        break;

      case MODULUS/2:

        status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

        clFlush(commandQueue);

        status = clWaitForEvents(1, &outputEvent);

        status = clReleaseEvent(outputEvent);

        outputEvent = NULL;

        break;

      default :

        status = clEnqueueNDRangeKernel(commandQueue, outputkernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

        break;

    }

  }

  /*Step 11: Read the output back to host memory.*/

    status = clEnqueueReadBuffer(commandQueue, outputBuffer, CL_TRUE, 0, strlength * sizeof(char), output, 0, NULL, NULL);

  if (outputEvent) status = clReleaseEvent(outputEvent);

The loop will now try to keep a few kernels "in flight" while waiting for one of them to keep the queue short. However, for short-running kernels, the disadvantage is that the GPU can no longer be fully loaded - the queue runs empty too frequently.

AMD, please fix the CPU-load problem so that workarounds like above are not necessary. Did I already mention that Catalyst 13.1 -13.3 did not show this problem?

himanshu_gautam
Grandmaster

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

Hi

I executed the program attached here.

My environment is APU Trinity m/c, windows7 - 64bit, with Visual studio

I executed via VS2010. the program gets hanged when clEnqueueReadbuffer () function calls. And the CPU performance will be very low like 0 to 1%.

Did you used any debugger or other tools to analyse the performace?

If so please do let me I will check with that as well.

0 Likes
Bdot
Adept III

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

Thank you for checking on this issue. It is intended that the program takes quite some time to finish, and it will wait in clEnqueueReadbuffer. As APUs often are less powerful than discrete GPUs, you may want to reduce the number of loops that are run in the kernels.

However, if you already see 0.1% CPU load while the program runs (I hope the GPU load is ~100%), then the problem may not exist on APUs, or you already run a driver newer than 13.10 that may not show the issue. I suggest for reproducing the high CPU load, use a discrete GPU with Catalyst between 13.4 and 13.10.

0 Likes
himanshu_gautam
Grandmaster

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

Ya i had tested with 13.11 beta driver.

I will try to get the discrete GPU and test it again and get back to you.

0 Likes
Bdot
Adept III

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

I just tried AMD_Catalyst_13.11_BetaV9.2 and the high CPU load is still reproducible on my HD5770.

Did you find the time to test this a bit more?

Thanks a lot for your help!

0 Likes
Bdot
Adept III

Re: Cat13.4: How to avoid the high CPU load for GPU kernels?

I also tried rebuilding it with APP SDK 2.9, the results are unchanged: scheduling a high number of kernels (~100 .. 200 and more) leads to high CPU load on discrete GPUs.

0 Likes