cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

lenjyco
Journeyman III

Profiling time for blocking and non-blocking execution

Hi all,

I'm new to OpenCL dev and i want to understand some mechanics.

I've a simple matrix multiplication kernel, and i want to see the impact of the blocking option for the clEnqueue* instructions.

So, i compute one time with blocking write&read and the other non blocking.

When I look to the profiling times of execution, I've, for the blocking version, a sequential order for each time (enqueue, submit, kernel start, kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.

Can someone explain me this behaviour, thank you very much.

Tags (2)
0 Likes
9 Replies
nou
Exemplar

Re: Profiling time for blocking and non-blocking execution

blocking call is equivalent of clFinish() after that call so it finish executing before return. unless you have out of order queue (which is not supported) you will always see execution in order.

0 Likes
himanshu_gautam
Grandmaster

Re: Profiling time for blocking and non-blocking execution


lenjyco wrote:



kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.



Can someone explain me this behaviour, thank you very much.


Getting a negative number for (END-START) is certainly strange. You can submit a test case for that.

also you should notice that non-blocking call returns in very less time(just after queueing your command), as the execution will be started at some later time. This should in general result in very high submit time (submit-queued), other times should be more or less comparable.

0 Likes
lenjyco
Journeyman III

Re: Profiling time for blocking and non-blocking execution

Hey,

thanks for the response,

@nou : i'm using IN-ORDER queue right now, i will test OUT-OF-ORDER after.

@himanshu.gautam : I think you misunderstood me, it's not the difference between END-START that is ne...

For example, i get the 4 differents time (Queueing, submiting, sarting and ending) in my non-blocking version and i get from my output :

- Queue = 27095955722029

- Submit = 27095177522023
- Start =  27095177553977

- End = 27095946659941

(Sorry for the use of raw data)

You can obverse that the execution starts before it's queued, it's nonsense for me but i'm sure i'm missing something.

Regards,

0 Likes
himanshu_gautam
Grandmaster

Re: Profiling time for blocking and non-blocking execution

ok. cl_event counters are very trust worthy. Can you attach you testcase, i can forward it to AMD Engg Team.

For now,  use some high precision CPU Timers only.

0 Likes
lenjyco
Journeyman III

Re: Profiling time for blocking and non-blocking execution

"Can you attach you testcase" : You mean that you want my code where i get those profiling time ?

0 Likes
himanshu_gautam
Grandmaster

Re: Profiling time for blocking and non-blocking execution

Yes I mean can you attach a minimal host code + kernel which gives such weird counter values. BTW, on the last reply i meant , the counters are NOT very trustworthy. Sorry for the typo.

I would suggest you to use High precision CPU Timers only. But AMD would certainly want to make their counters reliable, and your test case can help us greatly.

0 Likes
lenjyco
Journeyman III

Re: Profiling time for blocking and non-blocking execution

Here is my kernel :

kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C,  int widthA, int widthB, int k )
{
    int i = get_global_id(0);
    int j = get_global_id(1);
    float value = 0;
    for (int tmp =  0; tmp < k; tmp++)
    {
        for ( int x = 0; x < widthA; x++)
        {
            value = value + A[x + j * widthA] * B[x*widthB + i];
        }
        C[i + widthA * j] = value;
    }
}

and here is my hostcode which is a little bit heavy so i just attached the queueing instructions :

oclCopyHostToDeviceSynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu);

oclCopyHostToDeviceSynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu);

    //EXECUTE the kernel

    ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

    if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

    //Wait the end of computation, used to get time of execution

    clWaitForEvents(1, event_list_execute);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

//WRITE the data

    oclCopyHostToDeviceSynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu);

    oclCopyHostToDeviceSynch(queue_gpu, B, widthB * heightB * sizeof (float), b_in_gpu);

    //EXECUTE the kernel

    ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

    if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

    //Wait the end of computation

    clWaitForEvents(1, &event_list_execute[0]);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

oclCopyHostToDeviceAsynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu, &event_list_write_cpu[0]);

    oclCopyHostToDeviceAsynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu, &event_list_write_cpu[1]);

    oclCopyHostToDeviceAsynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu, &event_list_write_gpu[0]);

    oclCopyHostToDeviceAsynch(queue_gpu, B, widthA * heightA * sizeof (float), b_in_gpu, &event_list_write_gpu[1]);

    //Wait for the end of writting

    clWaitForEvents(2, event_list_write_gpu);

    clWaitForEvents(2, event_list_write_cpu);

    //EXECUTE the kernel

    ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

    if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

    ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[1]);

    if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

    //Wait the end of computation

    clWaitForEvents(2, event_list_execute);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

    clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

    for (int i = 0; i < 4; i++)

        profiling_time_tab[2] = time_buff;

    clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

    clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

    clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

    clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

The oclCopyHostToDeviceAsynch & oclCopyHostToDeviceSynch are just functions more easier to understand and the difference is that Asynch is non blocking while Synch is.

0 Likes
lenjyco
Journeyman III

Re: Profiling time for blocking and non-blocking execution

I've found the error, i think maybe it will interest someone. If you use clWaitforEvents with a list of event which belongs to differents contexts an error occur (i forgot to check the return value).

0 Likes
himanshu_gautam
Grandmaster

Re: Profiling time for blocking and non-blocking execution

Thanks for coming back on this... You save our time for sure...Secondly, it is also a great help to the community..

Yes, Context owns buffers, images (data), kernel (code) and events (synch)... You cannot do cross-context stuff like this. Thanks for coming back...

- Bruhaspati

0 Likes