Hi all,
I'm new to OpenCL dev and i want to understand some mechanics.
I've a simple matrix multiplication kernel, and i want to see the impact of the blocking option for the clEnqueue* instructions.
So, i compute one time with blocking write&read and the other non blocking.
When I look to the profiling times of execution, I've, for the blocking version, a sequential order for each time (enqueue, submit, kernel start, kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.
Can someone explain me this behaviour, thank you very much.
blocking call is equivalent of clFinish() after that call so it finish executing before return. unless you have out of order queue (which is not supported) you will always see execution in order.
lenjyco wrote:
kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.
Can someone explain me this behaviour, thank you very much.
Getting a negative number for (END-START) is certainly strange. You can submit a test case for that.
also you should notice that non-blocking call returns in very less time(just after queueing your command), as the execution will be started at some later time. This should in general result in very high submit time (submit-queued), other times should be more or less comparable.
Hey,
thanks for the response,
@nou : i'm using IN-ORDER queue right now, i will test OUT-OF-ORDER after.
For example, i get the 4 differents time (Queueing, submiting, sarting and ending) in my non-blocking version and i get from my output :
- Queue = 27095955722029
- Submit = 27095177522023
- Start = 27095177553977
- End = 27095946659941
(Sorry for the use of raw data)
You can obverse that the execution starts before it's queued, it's nonsense for me but i'm sure i'm missing something.
Regards,
ok. cl_event counters are very trust worthy. Can you attach you testcase, i can forward it to AMD Engg Team.
For now, use some high precision CPU Timers only.
"Can you attach you testcase" : You mean that you want my code where i get those profiling time ?
Yes I mean can you attach a minimal host code + kernel which gives such weird counter values. BTW, on the last reply i meant , the counters are NOT very trustworthy. Sorry for the typo.
I would suggest you to use High precision CPU Timers only. But AMD would certainly want to make their counters reliable, and your test case can help us greatly.
Here is my kernel :
kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C, int widthA, int widthB, int k )
{
int i = get_global_id(0);
int j = get_global_id(1);
float value = 0;
for (int tmp = 0; tmp < k; tmp++)
{
for ( int x = 0; x < widthA; x++)
{
value = value + A[x + j * widthA] * B[x*widthB + i];
}
C[i + widthA * j] = value;
}
}
and here is my hostcode which is a little bit heavy so i just attached the queueing instructions :
oclCopyHostToDeviceSynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu);
oclCopyHostToDeviceSynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu);
//EXECUTE the kernel
ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);
if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;
//Wait the end of computation, used to get time of execution
clWaitForEvents(1, event_list_execute);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);
//WRITE the data
oclCopyHostToDeviceSynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu);
oclCopyHostToDeviceSynch(queue_gpu, B, widthB * heightB * sizeof (float), b_in_gpu);
//EXECUTE the kernel
ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);
if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;
//Wait the end of computation
clWaitForEvents(1, &event_list_execute[0]);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);
oclCopyHostToDeviceAsynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu, &event_list_write_cpu[0]);
oclCopyHostToDeviceAsynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu, &event_list_write_cpu[1]);
oclCopyHostToDeviceAsynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu, &event_list_write_gpu[0]);
oclCopyHostToDeviceAsynch(queue_gpu, B, widthA * heightA * sizeof (float), b_in_gpu, &event_list_write_gpu[1]);
//Wait for the end of writting
clWaitForEvents(2, event_list_write_gpu);
clWaitForEvents(2, event_list_write_cpu);
//EXECUTE the kernel
ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);
if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;
ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[1]);
if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;
//Wait the end of computation
clWaitForEvents(2, event_list_execute);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);
clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);
for (int i = 0; i < 4; i++)
profiling_time_tab[2] = time_buff;
clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);
clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);
clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);
clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);
The oclCopyHostToDeviceAsynch & oclCopyHostToDeviceSynch are just functions more easier to understand and the difference is that Asynch is non blocking while Synch is.
I've found the error, i think maybe it will interest someone. If you use clWaitforEvents with a list of event which belongs to differents contexts an error occur (i forgot to check the return value).
Thanks for coming back on this... You save our time for sure...Secondly, it is also a great help to the community..
Yes, Context owns buffers, images (data), kernel (code) and events (synch)... You cannot do cross-context stuff like this. Thanks for coming back...
- Bruhaspati