16 Replies Latest reply on Jun 29, 2011 2:10 PM by pinzo

    kernel submit time is too long

    pinzo
      submit time >> enqueue time and execution time

      hi,

      I have done a host code that call 1 kernel and i use the profiling to see my kernel execution time and the launch time. I read in AMD programming guide that launch time (CL_PROFILING_COMMAND_START - CL_PROFILING_COMMAND_QUEUE) is about 225 microsecond and the profiling adds about 40  microsecond.

      Instead my kernel launch time (start-queue) takes about 2.64 millisecond while the esecution time (end-start) takes 0.43 millisecond. The most part of this 2.64 ms come from submit time (start-submit)=2.61 ms while (submit-queue)=0.03ms.

      why is the submit time so long? Is it normal? what can I do?

      thanks very much,

      bye

      errcode=clSetKernelArg(kernel,0,sizeof(cl_mem), (void*) &hits_dbuff); if(errcode != CL_SUCCESS) printf("failed arg 0: %d\n",errcode); errcode=clSetKernelArg(kernel,1,sizeof(cl_mem), (void*) &candy_dbuff); if(errcode != CL_SUCCESS) printf("failed arg 1: %d\n",errcode); errcode=clSetKernelArg(kernel,2,sizeof(cl_mem), (void*) &nn_evt_dbuff); if(errcode != CL_SUCCESS) printf("failed arg 2: %d\n",errcode); size_t local_work_size=256; size_t global_work_size =local_work_size*GRID_DIM; size_t group_work=global_work_size/local_work_size; cl_ulong start, end, queued, submit; errcode=clEnqueueNDRangeKernel(queue,kernel,1,NULL,&global_work_size,&local_work_size,0,NULL,&event); clWaitForEvents(1,&event); if(errcode != CL_SUCCESS) printf("failed NDrange: %d\n",errcode); clEnqueueBarrier(queue); clFlush(queue); clFinish(queue); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong), &queued, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_SUBMIT,sizeof(cl_ulong), &submit, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,sizeof(cl_ulong), &start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); clReleaseEvent(event);

        • kernel submit time is too long
          maximmoroz

          Lazy buffer allocation.

          Try profiling the 2nd kernel run instead of the 1st one.

            • kernel submit time is too long
              pinzo

              I try with a "for cycle" external to all the host program but all the repetition have the same submit time.

              what I should do?

                • kernel submit time is too long
                  maximmoroz

                  pinzo, try running the following code:

                   

                  errcode=clSetKernelArg(kernel,0,sizeof(cl_mem), (void*) &hits_dbuff); if(errcode != CL_SUCCESS) printf("failed arg 0: %d\n",errcode); errcode=clSetKernelArg(kernel,1,sizeof(cl_mem), (void*) &candy_dbuff); if(errcode != CL_SUCCESS) printf("failed arg 1: %d\n",errcode); errcode=clSetKernelArg(kernel,2,sizeof(cl_mem), (void*) &nn_evt_dbuff); if(errcode != CL_SUCCESS) printf("failed arg 2: %d\n",errcode); size_t local_work_size=256; size_t global_work_size =local_work_size*GRID_DIM; size_t group_work=global_work_size/local_work_size; cl_ulong start, end, queued, submit; errcode=clEnqueueNDRangeKernel(queue,kernel,1,NULL,&global_work_size,&local_work_size,0,NULL,&event); clWaitForEvents(1,&event); errcode=clEnqueueNDRangeKernel(queue,kernel,1,NULL,&global_work_size,&local_work_size,0,NULL,&event); clWaitForEvents(1,&event); if(errcode != CL_SUCCESS) printf("failed NDrange: %d\n",errcode); clEnqueueBarrier(queue); clFlush(queue); clFinish(queue); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong), &queued, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_SUBMIT,sizeof(cl_ulong), &submit, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,sizeof(cl_ulong), &start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); clReleaseEvent(event);

                  • kernel submit time is too long
                    pinzo

                    I have just tried and the submit time very little shorter than before, now takes 2.43 ms.

                    Some idea?

                    thank you

                      • kernel submit time is too long
                        maximmoroz

                        What are OS, driver and AMD APP SDK versions?

                          • kernel submit time is too long
                            pinzo

                            i use  Scientific Linux 5.5, ati-stream-sdk-v2.3-lnx64, and ATI Catalyst 10.12

                              • kernel submit time is too long
                                maximmoroz

                                Well, I have an obvious suggestion: Try with newer APP SDK and driver. I remember they said that they improved kernel launch time in some of recent releases.

                                P.S. But don't use Catalyst 11.6, it has several flaws. Use 11.5 instead.

                                http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_Release_Notes_Developer.pdf : Improved kernel launch times

                                  • kernel submit time is too long
                                    pinzo

                                    I install Catalyst 11.5 and APP SDK 2.4, the submit time is improved and it take only 446 microsecond, but now there is an other problem: with the new APP SDK version the esecution time take 2.77 ms instead of 0.43 ms.

                                    why?

                                    thank you very much for your answer

                                      • kernel submit time is too long
                                        maximmoroz

                                        Pinzo, show me the kernel's source code.

                                          • kernel submit time is too long
                                            pinzo

                                            ok,

                                            thanks

                                            #define CALC_THREAD 256 #define PM_X_THREAD 4 #define H_MIN 0 //cm #define H_MAX 1024 //cm #define NUM_BINS 16 #define BINS_RIDOTTI NUM_BINS/4 #define BIN_WIDTH (H_MAX-H_MIN)/NUM_BINS #define NUM_HITS_THR 7 #define PRIMO_PUNTO_X -30 #define PRIMO_PUNTO_Y -30 #define PASSI_X 5 #define PASSI_Y 5 #define PASSO (2*(-PRIMO_PUNTO_X)/PASSI_X)+1 #define LIMITE 170 #pragma OPENCL EXTENSION cl_amd_printf : enable #pragma OPENCL EXTENSION cl_amd_fp64 : enable __kernel void domh(__global float *hits_d,__global int3 *candy_d,__global int *nn_evt_d) //hits_d era un float2 { int nnevt = nn_evt_d[get_group_id(0)]; __local int histos[256]; __local int hh_step2[256]; __local int hh_step3[64]; __local int hh_step4[8]; __local int pre_histox[BINS_RIDOTTI][256]; __local float2 hits_s[32]; //era 2 anche in domh() int indice=0; if(get_local_id(0) < 32) { hits_s[get_local_id(0)].x = -28.8 + 0.9*((int) hits_d[(get_group_id(0)<<5)+get_local_id(0)] % 65); hits_s[get_local_id(0)].y = -28.0592 + 1.55885*((int)hits_d[(get_group_id(0)<<5)+get_local_id(0)] % 2) +3.1177*(floor(native_divide(hits_d[(get_group_id(0)<<5)+get_local_id(0)],65))); //printf("hits_s[%ld].x=%f\n",get_local_id(0),hits_s[get_local_id(0)].x); //printf("hits_s[%ld].y=%f\n",get_local_id(0),hits_s[get_local_id(0)].y); } //inizializza shared memory histos[2 * get_local_id(0)] = 0; histos[(2 * get_local_id(0))+1] = 0; barrier(CLK_LOCAL_MEM_FENCE); //corpo centrale float primo_punto_x = PRIMO_PUNTO_X; float primo_punto_y = PRIMO_PUNTO_Y; float passi_x = PASSI_X; float passi_y = PASSI_Y; int passo = PASSO; int Idx_centerpmx1 = get_local_id(0); int Idx_centerpmx2 = get_local_id(0)+256; int Idx_centerpmx3 = get_local_id(0)+512; int Idx_centerpmx4 = get_local_id(0)+768; float xcenterx1 = primo_punto_x + passi_x * (Idx_centerpmx1 % passo); float ycenterx1 = primo_punto_y + passi_y * floor(Idx_centerpmx1/passo); float xcenterx2 = primo_punto_x + passi_x * (Idx_centerpmx2 % passo); float ycenterx2 = primo_punto_y + passi_y * floor(Idx_centerpmx2/passo); float xcenterx3 = primo_punto_x + passi_x * (Idx_centerpmx3 % passo); float ycenterx3 = primo_punto_y + passi_y * floor(Idx_centerpmx3/passo); float xcenterx4 = primo_punto_x + passi_x * (Idx_centerpmx4 % passo); float ycenterx4 = primo_punto_y + passi_y * floor(Idx_centerpmx4/passo); //printf("x1 = %f y1 = %f x2 = %f y2 = %f x3 = %f y3 = %f\n",xcenterx1, ycenterx1,xcenterx2,ycenterx2,xcenterx3, ycenterx3 ); #pragma unroll for (int i=0;i<BINS_RIDOTTI;i++) { pre_histox[i][get_local_id(0)]=0; } int maxhx1 = 0; int maxnx1 = 0; if(get_local_id(0)<LIMITE) { // primo istogramma #pragma unroll for (int idx=0;idx<nnevt;idx++) { float xhit = hits_s[idx].x; float yhit = hits_s[idx].y; float disx = (xhit-xcenterx1)*(xhit-xcenterx1)+(yhit-ycenterx1)*(yhit-ycenterx1); int ndisx = floor(native_divide(disx,BIN_WIDTH)); indice = floor(native_divide(ndisx,4)); pre_histox[indice][get_local_id(0)] += (ndisx%4 == 0 && ndisx<NUM_BINS) ? ((1<<24) & 0XFF000000) : 0; pre_histox[indice][get_local_id(0)] += (ndisx%4 == 1 && ndisx<NUM_BINS) ? ((1<<16) & 0X00FF0000) : 0; pre_histox[indice][get_local_id(0)] += (ndisx%4 == 2 && ndisx<NUM_BINS) ? ((1<<8) & 0X0000FF00) : 0; pre_histox[indice][get_local_id(0)] += (ndisx%4 == 3 && ndisx<NUM_BINS) ? (1 & 0X000000FF) : 0; }//creo il primo istogramma #pragma unroll for (int i=0;i<BINS_RIDOTTI;i++) { maxhx1 = (((pre_histox[i][get_local_id(0)] & 0XFF000000)>>24)>maxhx1) ? (pre_histox[i][get_local_id(0)] & 0XFF000000)>>24 : maxhx1; maxnx1 = (((pre_histox[i][get_local_id(0)] & 0XFF000000)>>24)>maxhx1) ? 4*i : maxnx1; maxhx1 = (((pre_histox[i][get_local_id(0)] & 0X00FF0000)>>16)>maxhx1) ? (pre_histox[i][get_local_id(0)] & 0X00FF0000)>>16 : maxhx1; maxnx1 = (((pre_histox[i][get_local_id(0)] & 0X00FF0000)>>16)>maxhx1) ? 4*i+1 : maxnx1; maxhx1 = (((pre_histox[i][get_local_id(0)] & 0X0000FF00)>>8)>maxhx1) ? (pre_histox[i][get_local_id(0)] & 0X0000FF00)>>8 : maxhx1; maxnx1 = (((pre_histox[i][get_local_id(0)] & 0X0000FF00)>>8)>maxhx1) ? 4*i+2 : maxnx1; maxhx1 = ((pre_histox[i][get_local_id(0)] & 0X000000FF)>maxhx1) ? pre_histox[i][get_local_id(0)] & 0X000000FF : maxhx1; maxnx1 = ((pre_histox[i][get_local_id(0)] & 0X000000FF)>maxhx1) ? 4*i+3 : maxnx1; } //ho trovato la migliore distanza del primo istogramma } histos[get_local_id(0)] =((maxnx1<<24) & 0XFF000000) | ((maxhx1<<16) & 0X00FF0000); //ho messo la migliore distanza di ogni PMT con il numero di conteggi nel vettore histos[], due per ogni indice di histos, i bytes maggiori contengono i primi 512 PMT i bytes minori contengono i successivi 512 PMT barrier(CLK_LOCAL_MEM_FENCE); // in questo step uso 256 work-item che confrontano i 1024 histogrammi 4 l'uno int hh_cont_max = 0; int hh_addr_max = 0; int hh_cont =0; hh_cont = (histos[get_local_id(0)] & 0x00FF0000)>>16; hh_cont_max = hh_cont; hh_addr_max = get_local_id(0); hh_step2[get_local_id(0)] = hh_addr_max; barrier(CLK_LOCAL_MEM_FENCE); if(get_local_id(0)<64) { hh_addr_max = 0; hh_cont_max = 0; int idx_step3_1 = hh_step2[get_local_id(0)]; int idx_step3_2 = hh_step2[get_local_id(0) + 64]; int idx_step3_3 = hh_step2[get_local_id(0) + 128]; int idx_step3_4 = hh_step2[get_local_id(0) + 192]; int hh_cont3 = (histos[idx_step3_1] & 0x00FF0000)>>16; if (hh_cont3 > hh_cont_max) { hh_cont_max = hh_cont3; hh_addr_max = idx_step3_1; } hh_cont3 = (histos[idx_step3_2] & 0x00FF0000)>>16; if (hh_cont3 > hh_cont_max) { hh_cont_max = hh_cont3; hh_addr_max = idx_step3_2; } hh_cont3 = (histos[idx_step3_3] & 0x00FF0000)>>16; if (hh_cont3 > hh_cont_max) { hh_cont_max = hh_cont3; hh_addr_max = idx_step3_3; } hh_cont3 = (histos[idx_step3_4] & 0x00FF0000)>>16; if (hh_cont3 > hh_cont_max) { hh_cont_max = hh_cont3; hh_addr_max = idx_step3_4; } hh_step3[get_local_id(0)] = hh_addr_max; } barrier(CLK_LOCAL_MEM_FENCE); if(get_local_id(0)>=64 && get_local_id(0)<72) { hh_addr_max = 0; hh_cont_max = 0; for(int l=0; l<2; l++) //potrei srotolarlo in tutti ed 8 invece che in 4, pero con consumo maggiore di registri { // int idx_step4_1 = hh_step3[4*(get_local_id(0)-64) + 32*l]; // int idx_step4_2 = hh_step3[4*(get_local_id(0)-64) + 32*l+1]; // int idx_step4_3 = hh_step3[4*(get_local_id(0)-64) + 32*l+2]; // int idx_step4_4 = hh_step3[4*(get_local_id(0)-64) + 32*l+3]; int idx_step4_1 = hh_step3[(get_local_id(0)-64) + 32*l]; int idx_step4_2 = hh_step3[(get_local_id(0)-64) + 32*l+8]; int idx_step4_3 = hh_step3[(get_local_id(0)-64) + 32*l+16]; int idx_step4_4 = hh_step3[(get_local_id(0)-64) + 32*l+24]; int hh_cont4 = (histos[idx_step4_1] & 0x00FF0000)>>16; if (hh_cont4 > hh_cont_max) { hh_cont_max = hh_cont4; hh_addr_max = idx_step4_1; } hh_cont4 = (histos[idx_step4_2] & 0x00FF0000)>>16; if (hh_cont4 > hh_cont_max) { hh_cont_max = hh_cont4; hh_addr_max = idx_step4_2; } hh_cont4 = (histos[idx_step4_3] & 0x00FF0000)>>16; if (hh_cont4 > hh_cont_max) { hh_cont_max = hh_cont4; hh_addr_max = idx_step4_3; } hh_cont4 = (histos[idx_step4_4] & 0x00FF0000)>>16; if (hh_cont4 > hh_cont_max) { hh_cont_max = hh_cont4; hh_addr_max = idx_step4_4; } } hh_step4[get_local_id(0)-64] = hh_addr_max; } barrier(CLK_LOCAL_MEM_FENCE); if(get_local_id(0) == 72) { hh_addr_max = 0; hh_cont_max = 0; #pragma unroll for (int i=0;i<8;i+=4) { int idx_step5_1 = hh_step4[i]; int idx_step5_2 = hh_step4[i+1]; int idx_step5_3 = hh_step4[i+2]; int idx_step5_4 = hh_step4[i+3]; int hh_cont5 = (histos[idx_step5_1] & 0x00FF0000)>>16; if (hh_cont5 > hh_cont_max) { hh_cont_max = hh_cont5; hh_addr_max = idx_step5_1; } hh_cont5 = (histos[idx_step5_2] & 0x00FF0000)>>16; if (hh_cont5 > hh_cont_max) { hh_cont_max = hh_cont5; hh_addr_max = idx_step5_2; } hh_cont5 = (histos[idx_step5_3] & 0x00FF0000)>>16; if (hh_cont5 > hh_cont_max) { hh_cont_max = hh_cont5; hh_addr_max = idx_step5_3; } hh_cont5 = (histos[idx_step5_4] & 0x00FF0000)>>16; if (hh_cont5 > hh_cont_max) { hh_cont_max = hh_cont5; hh_addr_max = idx_step5_4; } } candy_d[get_group_id(0)].x = (histos[hh_addr_max] & 0x00FF0000)>>16; candy_d[get_group_id(0)].y= hh_addr_max; candy_d[get_group_id(0)].z = (histos[hh_addr_max] & 0xFF000000)>>24; //printf("candy_d[%d].y(center)=%d\n",get_group_id(0),candy_d[get_group_id(0)].y); //printf("candy_d[%d].x(max)=%d\n",get_group_id(0),candy_d[get_group_id(0)].x); //printf("candy_d[%d].z(radius)=%d\n",get_group_id(0),candy_d[get_group_id(0)].z); } barrier(CLK_LOCAL_MEM_FENCE); return; }

                                              • kernel submit time is too long
                                                pinzo

                                                with the new APP SDK version even the EnqueueReadBuffer and EnqueueWriteBuffer time have increased of about 10 times. why?

                                                • kernel submit time is too long
                                                  maximmoroz

                                                  The problem is that building your kernel crashes with Catalyst 11.6, which I am unable to uninstall.

                                                  Still, some points and suggestions:

                                                  1) make parameters hits_d and nn_evt_d 'const' as far as you don't modify the buffers (and add 'restrict' to enable caching):

                                                  __kernel void domh(const __global float * restrict hits_d,__global int3 *candy_d, const __global int * restrict nn_evt_d)

                                                  2) Possible out-of-range buffer access:

                                                   histos[2 * get_local_id(0)] = 0;
                                                   histos[(2 * get_local_id(0))+1] = 0;

                                                  3) If you don't modify private variable make it const. For example:

                                                  const int passo = PASSO;

                                                  This will enable the compiler to treat 'passo' as constant and optimize subsequent / and % operations.

                                                  4) Use mul24 and mad24 whenever possible.

                                                  float xcenterx4 = primo_punto_x + passi_x * (Idx_centerpmx4 % passo);

                                                  to be replaced (if possible) with:

                                                  float xcenterx4 = mad24(passi_x, Idx_centerpmx4 % passo, primo_punto_x);

                                                  5) While native_divide is fast it is not as fast as multiplication:

                                                  native_divide(disx,BIN_WIDTH)

                                                  to be replcaed with:

                                                  disx * (1.0 / BIN_WIDTH)

                                                  The compilator will do (1.0 / BIN_WIDTH) itself.

                                                  The same with native_divide(ndisx,4)

                                                  6) The whole cycle prefixed with comment "// primo istogramma". Each work-item updates just "NUM_BINS/4" values in pre_histox array. I am sure it is much faster to update private array of size "NUM_BINS/4", initialized with 0, and then update corresponding indices in the local array.

                                                  Hey, you might even write these values to local array after you processed them in the next cycle, which ends with "//ho trovato la migliore distanza del primo istogramma".

                                                  7) Then comes some reduction code which I don't understand, sorry. But it looks very suspicous as there is a lot of code which works for several or even just single work item.

                                                  8) Still I suggest you to replace all small 'if's with predicates:

                                                  For example, the following code:

                                                     if (hh_cont4 > hh_cont_max)
                                                     {
                                                      hh_cont_max = hh_cont4;
                                                      hh_addr_max = idx_step4_2;
                                                     }

                                                  better replace it with:

                                                  const bool b4 = hh_cont4 > hh_cont_max;
                                                  hh_cont_max = b4 ? hh_cont4 : hh_cont_max;
                                                  hh_addr_max = b4 ? idx_step4_2 : hh_addr_max;

                                                  Thus you will have less operations and, what is more important, clauses.

                                                  9) Write to candy_d: why are you doing it 3 times?

                                                  candy_d[get_group_id(0)] = (int3){(histos[hh_addr_max] & 0x00FF0000)>>16, hh_addr_max, (histos[hh_addr_max] & 0xFF000000)>>24};

                                                  10) Final barrier(CLK_LOCAL_MEM_FENCE) is useless. It just slows your kernel

                                                  And I have a question: What is global_work_size you run your kernel for? And the device?

                                                  I also suggest you to read AMD APP OpenCL Programming Guide. It is clear and very useful document.