Archives Discussions

lenjyco · ‎07-19-2013

Hi all,

I'm new to OpenCL dev and i want to understand some mechanics.

I've a simple matrix multiplication kernel, and i want to see the impact of the blocking option for the clEnqueue* instructions.

So, i compute one time with blocking write&read and the other non blocking.

When I look to the profiling times of execution, I've, for the blocking version, a sequential order for each time (enqueue, submit, kernel start, kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.

Can someone explain me this behaviour, thank you very much.

nou · ‎07-20-2013

blocking call is equivalent of clFinish() after that call so it finish executing before return. unless you have out of order queue (which is not supported) you will always see execution in order.

himanshu_gautam · ‎07-22-2013

lenjyco wrote:

kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.

Can someone explain me this behaviour, thank you very much.

Getting a negative number for (END-START) is certainly strange. You can submit a test case for that.

also you should notice that non-blocking call returns in very less time(just after queueing your command), as the execution will be started at some later time. This should in general result in very high submit time (submit-queued), other times should be more or less comparable.

lenjyco · ‎07-22-2013

Hey,

thanks for the response,

@nou : i'm using IN-ORDER queue right now, i will test OUT-OF-ORDER after.

@himanshu.gautam : I think you misunderstood me, it's not the difference between END-START that is ne...

For example, i get the 4 differents time (Queueing, submiting, sarting and ending) in my non-blocking version and i get from my output :

- Queue = 27095955722029

- Submit = 27095177522023
- Start = 27095177553977

- End = 27095946659941

(Sorry for the use of raw data)

You can obverse that the execution starts before it's queued, it's nonsense for me but i'm sure i'm missing something.

Regards,

himanshu_gautam · ‎07-23-2013

ok. cl_event counters are very trust worthy. Can you attach you testcase, i can forward it to AMD Engg Team.

For now, use some high precision CPU Timers only.

lenjyco · ‎07-25-2013

"Can you attach you testcase" : You mean that you want my code where i get those profiling time ?

himanshu_gautam · ‎07-26-2013

Yes I mean can you attach a minimal host code + kernel which gives such weird counter values. BTW, on the last reply i meant , the counters are NOT very trustworthy. Sorry for the typo.

I would suggest you to use High precision CPU Timers only. But AMD would certainly want to make their counters reliable, and your test case can help us greatly.

lenjyco · ‎07-29-2013

Here is my kernel :

kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C, int widthA, int widthB, int k )
{
    int i = get_global_id(0);
    int j = get_global_id(1);
    float value = 0;
    for (int tmp = 0; tmp < k; tmp++)
    {
        for ( int x = 0; x < widthA; x++)
        {
            value = value + A[x + j * widthA] * B[x*widthB + i];
        }
        C[i + widthA * j] = value;
    }
}

and here is my hostcode which is a little bit heavy so i just attached the queueing instructions :

oclCopyHostToDeviceSynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu);

oclCopyHostToDeviceSynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu);

//EXECUTE the kernel

ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

//Wait the end of computation, used to get time of execution

clWaitForEvents(1, event_list_execute);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

//WRITE the data

oclCopyHostToDeviceSynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu);

oclCopyHostToDeviceSynch(queue_gpu, B, widthB * heightB * sizeof (float), b_in_gpu);

//EXECUTE the kernel

ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

//Wait the end of computation

clWaitForEvents(1, &event_list_execute[0]);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

oclCopyHostToDeviceAsynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu, &event_list_write_cpu[0]);

oclCopyHostToDeviceAsynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu, &event_list_write_cpu[1]);

oclCopyHostToDeviceAsynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu, &event_list_write_gpu[0]);

oclCopyHostToDeviceAsynch(queue_gpu, B, widthA * heightA * sizeof (float), b_in_gpu, &event_list_write_gpu[1]);

//Wait for the end of writting

clWaitForEvents(2, event_list_write_gpu);

clWaitForEvents(2, event_list_write_cpu);

//EXECUTE the kernel

ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[1]);

if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

//Wait the end of computation

clWaitForEvents(2, event_list_execute);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

for (int i = 0; i < 4; i++)

profiling_time_tab[2] = time_buff;

clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

The oclCopyHostToDeviceAsynch & oclCopyHostToDeviceSynch are just functions more easier to understand and the difference is that Asynch is non blocking while Synch is.

lenjyco · ‎07-30-2013

I've found the error, i think maybe it will interest someone. If you use clWaitforEvents with a list of event which belongs to differents contexts an error occur (i forgot to check the return value).

himanshu_gautam · ‎07-30-2013

Thanks for coming back on this... You save our time for sure...Secondly, it is also a great help to the community..

Yes, Context owns buffers, images (data), kernel (code) and events (synch)... You cannot do cross-context stuff like this. Thanks for coming back...

- Bruhaspati

Archives Discussions

Profiling time for blocking and non-blocking execution