10 Replies Latest reply on Jul 19, 2013 7:39 AM by lenjyco

    Profiling time for blocking and non-blocking execution

    lenjyco

      Hi all,

       

      I'm new to OpenCL dev and i want to understand some mechanics.

       

      I've a simple matrix multiplication kernel, and i want to see the impact of the blocking option for the clEnqueue* instructions.

      So, i compute one time with blocking write&read and the other non blocking.

       

      When I look to the profiling times of execution, I've, for the blocking version, a sequential order for each time (enqueue, submit, kernel start, kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.

       

      Can someone explain me this behaviour, thank you very much.

        • Re: Profiling time for blocking and non-blocking execution
          nou

          blocking call is equivalent of clFinish() after that call so it finish executing before return. unless you have out of order queue (which is not supported) you will always see execution in order.

          • Re: Profiling time for blocking and non-blocking execution
            himanshu.gautam

            lenjyco wrote:

             

            kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.

             

            Can someone explain me this behaviour, thank you very much.

            Getting a negative number for (END-START) is certainly strange. You can submit a test case for that.

            also you should notice that non-blocking call returns in very less time(just after queueing your command), as the execution will be started at some later time. This should in general result in very high submit time (submit-queued), other times should be more or less comparable.

            • Re: Profiling time for blocking and non-blocking execution
              lenjyco

              Hey,

               

              thanks for the response,

               

              @nou : i'm using IN-ORDER queue right now, i will test OUT-OF-ORDER after.

               

              @himanshu.gautam : I think you misunderstood me, it's not the difference between END-START that is negative .

               

              For example, i get the 4 differents time (Queueing, submiting, sarting and ending) in my non-blocking version and i get from my output :

               

              - Queue = 27095955722029

              - Submit = 27095177522023
              - Start =  27095177553977

              - End = 27095946659941

               

              (Sorry for the use of raw data)

               

              You can obverse that the execution starts before it's queued, it's nonsense for me but i'm sure i'm missing something.

               

              Regards,

                • Re: Profiling time for blocking and non-blocking execution
                  himanshu.gautam

                  ok. cl_event counters are very trust worthy. Can you attach you testcase, i can forward it to AMD Engg Team.

                  For now,  use some high precision CPU Timers only.

                    • Re: Profiling time for blocking and non-blocking execution
                      lenjyco

                      "Can you attach you testcase" : You mean that you want my code where i get those profiling time ?

                        • Re: Profiling time for blocking and non-blocking execution
                          himanshu.gautam

                          Yes I mean can you attach a minimal host code + kernel which gives such weird counter values. BTW, on the last reply i meant , the counters are NOT very trustworthy. Sorry for the typo.

                          I would suggest you to use High precision CPU Timers only. But AMD would certainly want to make their counters reliable, and your test case can help us greatly.

                            • Re: Profiling time for blocking and non-blocking execution
                              lenjyco

                              Here is my kernel :

                               

                              kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C,  int widthA, int widthB, int k )
                              {
                                  int i = get_global_id(0);
                                  int j = get_global_id(1);
                                  float value = 0;
                                  for (int tmp =  0; tmp < k; tmp++)
                                  {
                                      for ( int x = 0; x < widthA; x++)
                                      {
                                          value = value + A[x + j * widthA] * B[x*widthB + i];
                                      }
                                      C[i + widthA * j] = value;
                                  }
                              }

                               

                               

                               

                               

                              and here is my hostcode which is a little bit heavy so i just attached the queueing instructions :

                               

                               

                               

                               

                              oclCopyHostToDeviceSynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu);

                              oclCopyHostToDeviceSynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu);

                               

                               

                                  //EXECUTE the kernel

                                  ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

                                  if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

                               

                                  //Wait the end of computation, used to get time of execution

                                  clWaitForEvents(1, event_list_execute);

                               

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

                               

                              //WRITE the data

                                  oclCopyHostToDeviceSynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu);

                                  oclCopyHostToDeviceSynch(queue_gpu, B, widthB * heightB * sizeof (float), b_in_gpu);

                               

                               

                                  //EXECUTE the kernel

                                  ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

                                  if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

                               

                                  //Wait the end of computation

                                  clWaitForEvents(1, &event_list_execute[0]);

                               

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

                               

                              oclCopyHostToDeviceAsynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu, &event_list_write_cpu[0]);

                                  oclCopyHostToDeviceAsynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu, &event_list_write_cpu[1]);

                               

                                  oclCopyHostToDeviceAsynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu, &event_list_write_gpu[0]);

                                  oclCopyHostToDeviceAsynch(queue_gpu, B, widthA * heightA * sizeof (float), b_in_gpu, &event_list_write_gpu[1]);

                               

                               

                                  //Wait for the end of writting

                                  clWaitForEvents(2, event_list_write_gpu);

                                  clWaitForEvents(2, event_list_write_cpu);

                               

                               

                                  //EXECUTE the kernel

                                  ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

                                  if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

                                  ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[1]);

                                  if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

                               

                               

                               

                                  //Wait the end of computation

                                  clWaitForEvents(2, event_list_execute);

                               

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                                  clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

                                  for (int i = 0; i < 4; i++)

                                      profiling_time_tab[2][i] = time_buff[i];

                               

                                  clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                                  clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                                  clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                                  clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

                               

                               

                               

                              The oclCopyHostToDeviceAsynch & oclCopyHostToDeviceSynch are just functions more easier to understand and the difference is that Asynch is non blocking while Synch is.

                      • Re: Profiling time for blocking and non-blocking execution
                        lenjyco

                        I've found the error, i think maybe it will interest someone. If you use clWaitforEvents with a list of event which belongs to differents contexts an error occur (i forgot to check the return value).