cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

The_G
Journeyman III

Submitting commands is slow

submit time >> work time

Hi,

i'm experiencing some weird performance problems in my program. What the host code does is calling 3 kernels in a loop, writing two input buffers at the beginning and one output buffer at the end of each iteration.

The cltrace of one iteration is attached. You can see the problem most notably for the first buffer write. The write itself (COMMAND_END - COMMAND_START) takes 0.4ms, but the time from queuing to submission (COMMAND_SUBMIT - COMMAND_QUEUE) is 8.3ms! The other commands take about 1.1ms to submit, more than the working time f...

In synthetic test cases the submit time is around 0.01ms, what looks much more plausible.

So what can slow things down in such a way?

 

All tests run on a 5870, using SDK 2.3

 

20 clGetMemObjectInfo 4186701410374 4186701412819 54 clEnqueueWriteBuffer 4186701418197 4186701429441 4596 CL_COMMAND_WRITE_BUFFER 4186701426019 4186709736643 4186709872368 4186710280613 0 0x04B504F0 0 0x04B4D788 Cypress 22848 20 clGetMemObjectInfo 4186709139709 4186709142642 20 clGetMemObjectInfo 4186709146553 4186709148509 54 clEnqueueWriteBuffer 4186709513220 4186709523975 4596 CL_COMMAND_WRITE_BUFFER 4186709521042 4186710152686 4186710280612 4186710979352 0 0x04B504F0 0 0x04B4D788 Cypress 38400 17 clRetainMemObject 4186709528375 4186709530331 39 clSetKernelArg 4186709533264 4186709535709 18 clReleaseMemObject 4186709538642 4186709540598 17 clRetainMemObject 4186709543042 4186709545486 39 clSetKernelArg 4186709547442 4186709549886 18 clReleaseMemObject 4186709552331 4186709554286 17 clRetainMemObject 4186709556731 4186709558686 39 clSetKernelArg 4186709561131 4186709563086 18 clReleaseMemObject 4186709565531 4186709567486 17 clRetainMemObject 4186709569931 4186709572375 39 clSetKernelArg 4186709574331 4186709576775 18 clReleaseMemObject 4186709579220 4186709581175 66 clEnqueueNDRangeKernel 4186709584109 4186709594375 4592 CL_COMMAND_NDRANGE_KERNEL 4186709591931 4186710702687 4186710978864 4186711075167 0 0x04B504F0 0 0x04B4D788 Cypress 0x04B658D0 KernelA {4800} {NULL} 17 clRetainMemObject 4186709598286 4186709600731 39 clSetKernelArg 4186709602686 4186709605131 18 clReleaseMemObject 4186709607575 4186709609531 17 clRetainMemObject 4186709611975 4186709613931 39 clSetKernelArg 4186709616375 4186709618331 18 clReleaseMemObject 4186709620775 4186709623220 17 clRetainMemObject 4186709625175 4186709627620 39 clSetKernelArg 4186709629575 4186709632020 18 clReleaseMemObject 4186709634464 4186709636420 17 clRetainMemObject 4186709638864 4186709640820 39 clSetKernelArg 4186709643264 4186709645220 18 clReleaseMemObject 4186709648153 4186709650109 66 clEnqueueNDRangeKernel 4186709652553 4186709659886 4592 CL_COMMAND_NDRANGE_KERNEL 4186709657442 4186710797042 4186711074189 4186711178058 0 0x04B504F0 0 0x04B4D788 Cypress 0x04B65A10 KernelB {4800} {NULL} 17 clRetainMemObject 4186709662820 4186709664775 39 clSetKernelArg 4186709667220 4186709669664 18 clReleaseMemObject 4186709671620 4186709674064 17 clRetainMemObject 4186709676020 4186709678464 39 clSetKernelArg 4186709680420 4186709682864 18 clReleaseMemObject 4186709685309 4186709687264 17 clRetainMemObject 4186709689709 4186709691664 39 clSetKernelArg 4186709694109 4186709696064 18 clReleaseMemObject 4186709698509 4186709700464 66 clEnqueueNDRangeKernel 4186709703398 4186709710731 4592 CL_COMMAND_NDRANGE_KERNEL 4186709708286 4186710850819 4186711178057 4186711215975 0 0x04B504F0 0 0x04B4D788 Cypress 0x04B65810 KernelC {4800} {NULL} 52 clEnqueueReadBuffer 4186709713664 4186712140020 4595 CL_COMMAND_READ_BUFFER 4186709718064 4186710888953 4186711215975 4186712029042 0 0x04B504F0 0 0x04B4D788 Cypress 998400

0 Likes
5 Replies
nou
Exemplar

look at end start of adjacent of enqueue commands. you can see that they start one nanosecond or even ~400 nanosecond earlyer than end previous end.

so i can say that you observe a latency in execution but GPU are busy 100% of time. try enqueue multiple loop of computation to hide this latency.

0 Likes
The_G
Journeyman III

Oh yes i see. That explains the times between commands in one iteration. But i still have a problem with the first write in each iteration. Could you please look at the attached transition of one iteration into the next? After the OpenCL stuff the CPU has some work to do in each iteration so there is a major idle phase for the GPU between these calls.

The write is enqueued more than 10ms after the last read is done but it still needs its 7-8ms to start the write.

52 clEnqueueReadBuffer 4181634084220 4181636542353 4595 CL_COMMAND_READ_BUFFER 4181634088620 4181635317198 4181635657313 4181636440665 0 0x04B504F0 0 0x04B4D788 Cypress 995904 20 clGetMemObjectInfo 4181650341733 4181650344666 54 clEnqueueWriteBuffer 4181650349555 4181650360311 4596 CL_COMMAND_WRITE_BUFFER 4181650357377 4181657932222 4181658078911 4181658552651 0 0x04B504F0 0 0x04B4D788 Cypress 22848

0 Likes

The G,

It is generally not recommended to profile the clCommands without a following clFinish.

Here if you take END - Submit = .620429 ms.

But Queued - END = 8.195274 ms.

So the command is actually waiting in queue for execution for about 7.5ms.

For more information see the OpenCL Spec1.1.

0 Likes

But why the command is waiting so long? I've added a new log. It's again a transition between to iterations, now with a clFinish at the end of the first iteration.

You can see that there is a idle time of 2.5ms between the clFinish and the clGetMemObjectInfo. So the queue has to be empty and the GPU should be idle. But again the write buffer command stucks:

COMMAND_START - COMMAND_QUEUED = 6.26ms

What the hell is it doing? I can't find any hint why it doesn't just start.

52 clEnqueueReadBuffer 8229970235183 8229970242516 4595 CL_COMMAND_READ_BUFFER 8229970240072 8229970255717 8229970427200 8229970651717 0 0x042204F0 0 0x0421D788 Cypress 32688 51 clFinish 8229970245450 8229970691316 20 clGetMemObjectInfo 8229973142117 8229973145539 54 clEnqueueWriteBuffer 8229973149939 8229973160206 4596 CL_COMMAND_WRITE_BUFFER 8229973157272 8229974224028 8229979419127 8229979433472 0 0x042204F0 0 0x0421D788 Cypress 22848

0 Likes

Command queue profiling adds a lot of overhead and should only be used to measure elasped time of an individual command, not to measure overall overhead.

Jeff

0 Likes