cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

obara
Journeyman III

OpenCL API execution time is too long

My system frequently(10000 or more) executes the following OpenCL API.

----------------------------------------------

clEnqueueTask(command_queue, kernel, 0, NULL,&event);  (1us)

clWaitForEvents(1, &event);                          (100us)

----------------------------------------------

__kernel void add(__global int* A, __global float* B, __global float* C)

{

  *C = *A + *B;

}

----------------------------------------------------

But there is fatal defect that

The OpenCL API execution time is too long.

For example,

clEnqueueTask API takes 1us/1call,

the following clWaitForEvents API takes 100us/1call.

How can I manage the API execution time.

0 Likes
4 Replies
gopal
Staff

Can you elaborate it little more? It is not clear with this what you want to ask.

As per your data, it seems that clEnqueueTask() api call takes less time compare to clWaitForEvents() call. What reference you are saying that OpenCL execution time (in this case clEnqueueTask() timing) is too long?

0 Likes

Hi ratul

Think you for reply

I explain background of my question in the following.

I try to use APU(AMD A10-7850A) for DataBase Aggregate computations.

So OpenCL APIs(like "clEnqueueTask") are called 100,000,000 times.

GPGPU has excellent computing capability, but PCIe is bottleneck.

From the background, I try to use APU(AMD A10-7850A).

This time, I measured AMD A10-7850A performance for DataBase Aggregate computations.

But current resultis that GPU in APU(AMD A10-7850A) is terribly worse than CPU.

Because OpenCL-APIs takes much processing time.

For example,

    clCreateBuffer:40us

    clSetKernelArg:30us

    clEnqueueTask & clWaitForEvents:100us

Our DB system calls these APIs 100,000,000 times.

Compared with DataBase Aggregate computations estimated time,

CPU/GPU complex system is 100 times slower than CPU only.

I think it's caused by OpenCL-APIs processing time.

I expected, in case of APU, OpenCL-APIs processing time is too small, but isn't.

I want to know how to change OpenCL-APIs processing time the smaller,

especially APU(AMD A10-7850A), I think there are many optimization.

0 Likes

1. @Because OpenCL-APIs takes much processing time.

For example,

    clCreateBuffer:40us

    clSetKernelArg:30us

    clEnqueueTask & clWaitForEvents:100us

First tell me how you are measuring these api calls time?


2. @ "CPU/GPU complex system is 100 times slower than CPU only. I think it's caused by OpenCL-APIs processing time."

secondly, how you are comparing the CPU and CPU/GPU execution time, i mean how you are calculating these times?

Thanks,

0 Likes
maxdz8
Elite

Hello Obara, I've also noticed some overhead in kernel dispatch. I have a kernel which takes quite some time to run, it has an internal cycle on a known constant so I tinkered a bit with "unrolling the loop" host-side somehow.

It turned out that I would saturate a core with about 1k dispatches per second (EnqueueNDRangeKernel).

You probably cannot observe any high CPU usage due to Wait forcing a full stop-n-wait on GPU but this is very inefficient. You should absolutely try to "batch" (graphics jargon) more data in each call. It is my understanding clEnqueueTask should really not be used (it is deprecated in CL2 and removed from specification). If your kernel is a simple add, setting it up is going to take much, much more than just doing the work but I take for granted this is just an example.

If you're using an in-order queue, just wait on the last task, you should already get some improvement with an accompanying CPU spike.

0 Likes