cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eklund_n
Journeyman III

clEnqueueNDRangeKernel(), different function call time on CPU and GPU

not the actual execution time, just the function call

Hello!

I get very high times on the clEnqueueNDRangeKernel() call when using GPU. Look at the code below. I use CL_QUEUE_PROFILING_ENABLE on the command queue to do profiling on 'event', and (end-start) to get time of function call. 

On CPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.729372 ms
end - start = 0.003000 ms

On GPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.797341 ms
end - start = 4.194000 ms

If I use big global sizes this extra time is irrelevant, and GPU is faster than CPU. But I want my implementation to be fast on GPU even at small global sizes, and this big function call time stops that. Am I doing something wrong?

clFinish(hd->command_queue); gettimeofday(&start, NULL); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, event); gettimeofday(&end, NULL); clFinish(hd->command_queue);

0 Likes
47 Replies

Jeff,

Thanks for all these informations. It's clear now, I misunderstood clEnqueueWriteBuffer() semantic. By the way the fact it's linked to a queue is a little bit disappointing. What's the point with specifying the queue (then implicitly the device) if you don't really transfer data to the device ?

About your implementation, why the clEnqueueNDRangeKernel() couldn't return immediately and the OpenCL runtime manage all that stuff (allocation, transfers, ...) in the background ? I mean clEnqueueNDRangeKernel() should be totally asynchronous  ?

I know for the wait, the blocking and the event dependency, I only tried all the combinations 🙂

0 Likes

By the way I believe in Nvidia implementation clEnqueueWriteBuffer() it seems that directly copy the memory to the GPU. The oclBandwidthTest that they supply measure only the time for clEnqueueReadBuffer()

// allocate device memory

cmDevData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

oclCheckError(ciErrNum, CL_SUCCESS);

// ...

// ...

for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++) {

  ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevData, CL_FALSE, 0, memSize, h_data, 0, NULL, NULL);

  oclCheckError(ciErrNum, CL_SUCCESS);

}

ciErrNum = clFinish(cqCommandQueue);

oclCheckError(ciErrNum, CL_SUCCESS);

It means that on an AMD platform it only measure the same thing as a memcpy would do, staying on the host the whole time ?

0 Likes

IMHO one reason is that it can return CL_OUT_OF_RESOURCE error if buffers don't fit into device memory.

0 Likes

The reason clEnqueueWriteBuffer() is linked to a command queue as it allows for asynchronous behavior.  For example, you could use one command queue for kernel dispatches and another for copies.  Also, if it weren't linked to a command queue, how would you control which device to update?

0 Likes

What happens if I call clEnqueueWriteBuffer() on a queue linked to a device and then enqueue a kernel in a queue associated to another device ? (assuming the queue belongs to the same context)

0 Likes

Then the runtime will have to update the second device with data from the first.

0 Likes

There's something I don't understand, you told me that you won't allocate anything to a device until a kernel will "touch" it. Then I don't understand your claim about the queue

0 Likes

Updating the data with a copy counts as the device touching the memory.  It doesn't have to be a kernel operation.

0 Likes