Hello!
I get very high times on the clEnqueueNDRangeKernel() call when using GPU. Look at the code below. I use CL_QUEUE_PROFILING_ENABLE on the command queue to do profiling on 'event', and (end-start) to get time of function call.
On CPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.729372 ms
end - start = 0.003000 ms
On GPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.797341 ms
end - start = 4.194000 ms
If I use big global sizes this extra time is irrelevant, and GPU is faster than CPU. But I want my implementation to be fast on GPU even at small global sizes, and this big function call time stops that. Am I doing something wrong?
clFinish(hd->command_queue); gettimeofday(&start, NULL); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, event); gettimeofday(&end, NULL); clFinish(hd->command_queue);
Jeff,
Thanks for all these informations. It's clear now, I misunderstood clEnqueueWriteBuffer() semantic. By the way the fact it's linked to a queue is a little bit disappointing. What's the point with specifying the queue (then implicitly the device) if you don't really transfer data to the device ?
About your implementation, why the clEnqueueNDRangeKernel() couldn't return immediately and the OpenCL runtime manage all that stuff (allocation, transfers, ...) in the background ? I mean clEnqueueNDRangeKernel() should be totally asynchronous ?
I know for the wait, the blocking and the event dependency, I only tried all the combinations 🙂
By the way I believe in Nvidia implementation clEnqueueWriteBuffer() it seems that directly copy the memory to the GPU. The oclBandwidthTest that they supply measure only the time for clEnqueueReadBuffer()
// allocate device memory
cmDevData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);
oclCheckError(ciErrNum, CL_SUCCESS);
// ...
// ...
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++) {
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevData, CL_FALSE, 0, memSize, h_data, 0, NULL, NULL);
oclCheckError(ciErrNum, CL_SUCCESS);
}
ciErrNum = clFinish(cqCommandQueue);
oclCheckError(ciErrNum, CL_SUCCESS);
It means that on an AMD platform it only measure the same thing as a memcpy would do, staying on the host the whole time ?
IMHO one reason is that it can return CL_OUT_OF_RESOURCE error if buffers don't fit into device memory.
The reason clEnqueueWriteBuffer() is linked to a command queue as it allows for asynchronous behavior. For example, you could use one command queue for kernel dispatches and another for copies. Also, if it weren't linked to a command queue, how would you control which device to update?
What happens if I call clEnqueueWriteBuffer() on a queue linked to a device and then enqueue a kernel in a queue associated to another device ? (assuming the queue belongs to the same context)
Then the runtime will have to update the second device with data from the first.
There's something I don't understand, you told me that you won't allocate anything to a device until a kernel will "touch" it. Then I don't understand your claim about the queue
Updating the data with a copy counts as the device touching the memory. It doesn't have to be a kernel operation.