9 Replies Latest reply on Oct 15, 2011 10:25 PM by antzrhere

    performance difference between CUDA and OpenCL

    shingoxlf

      Hi all, I had a heart simulation program implemented in CUDA and OpenCL.

      I profiled the kernels and time spending on memcpy as in the attached code.

      And I did the similare thing for OpenCL code, however, the OpenCL profiling seems 99.99% time is spending on memory copy while CUDA is spending 1% on memory copy.

       

      Here is the profiling for OpenCL:

      gpu time = total time - mem copy time

       

      total         time is 176.00000000

      Kernel stim   time is 0.00000000

      Kernel cur    time is 0.01428032

      Kernel gate   time is 0.01075721

      Kernel bcs    time is 0.01370287

      memory copy   time is 175.97290492

      GPU           time is 0.02709508

       

       

       

       

      and here is the profling for CUDA:

       

      total         time is 101.22078705

      Kernel stim   time is 0.00289774

      Kernel cur    time is 67.28576040

      Kernel gate   time is 20.97523451

      Kernel bcs    time is 10.90470386

      memory copy   time is 1.55556965

      GPU           time is 99.66521740

      Can anybody tell me what caused this difference?

       

      while (derivarr[0] <= tfinal && step <= Nsteps + 1) { // from (1 to Nx) instead of (0 to Nx+1) // do not loop through ghost points */ //GPU Kernel Execution time_temp = rtclock(); if(stimnum>0) d_stimulate_kernel<<<dimGrid,dimBlock>>>(stimnum,d_datarr,d_stimarr,d_derivarr,varnum,step,Istimamp,Ny,stimint); cudaThreadSynchronize(); stim_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if(blocktimenum>0) d_blockonoff_kernel<<<dimGrid,dimBlock>>>(blocktimenum, d_derivarr, d_blocktimes, d_block, Nx, Ny); cudaThreadSynchronize(); block_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brgates_kernel<<<dimGrid,dimBlock>>>(varnum, d_datarr, d_derivarr, d_constarr, step, Ny); cudaThreadSynchronize(); gate_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brcurrents_kernel<<<dimGrid,dimBlock>>>(stimnum, d_datarr, d_derivarr, step, Istimamp, Ny, varnum, d_constarr, d_Afield, d_block, d_Dp, dt); cudaThreadSynchronize(); cur_time += (double)(rtclock()-time_temp); time_temp = rtclock(); dim3 dimGrid1(1,1,1); dim3 dimBlock1(1,1,1); kernel_call_device_bcs<<< dimGrid1, dimBlock1 >>>(dx, dy, d_D, BC, step, Nx, Ny, varnum, d_Dp, d_datarr, d_derivarr, dt); cudaThreadSynchronize(); cutilCheckMsg("CUDA Kernel"); bcs_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if (step % rpN == 0) { // Coalescing cudaMemcpy cutilSafeCall(cudaMemcpy(linear_datarr, d_datarr, (Nx+2)*(Ny+2)*varnum*2*sizeof(double), cudaMemcpyDeviceToHost)); // copy host memory to device for (i = 0; i < (Nx+2); i++) { for (j = 0; j < (Ny+2); j++) { for (k = 0; k < varnum; k++) { for (int l = 0; l < 2; l++) { datarr[i][j][k][l] = *(linear_datarr+ i*(Ny+2)*varnum*2+ j*varnum*2+ k*2+ l); } } } } output(); printf("%4.4e msec, Vm(%d,%d): %3.2f mV GPU\n", derivarr[0], mNx, mNy, datarr[mNx][mNy][0][step%2]); } mem_time += (double)(rtclock()-time_temp); step++; derivarr[0] += dt; deriv3darr[0][0][0] += dt; // update time (msec) } double gpu_end = rtclock(); printf("total time is %.8lf\n",(double)(gpu_end-gpu_start)); printf("Kernel stim time is %.8lf\n",stim_time); printf("Kernel cur time is %.8lf\n",cur_time); printf("Kernel gate time is %.8lf\n",gate_time); printf("Kernel bcs time is %.8lf\n",bcs_time); printf("memory copy time is %.8lf\n",mem_time); printf("GPU time is %.8lf\n",(double)(gpu_end-gpu_start)-mem_time);

        • performance difference between CUDA and OpenCL
          stgatilov

          I think OpenCL's timing is much stranger than CUDA's.

          Seeing OpenCL source would be much more helpful=)

          Perhaps you missed synchronization in OpenCL code...

            • performance difference between CUDA and OpenCL
              antzrhere

              Your timings must be incorrect. Apparently Kernel cur is nearly 471200% faster in OpenCL than CUDA. Unless your emulating CUDA code on an Amiga 500,I find that difficult to believe.

                • performance difference between CUDA and OpenCL
                  antzrhere

                  And why are you posting your Nvidia CUDA code on an AMD OpenCL forum...surely you should be posting your OpenCL code?

                  Make sure all your Kernel calls are blocking (or call clFinish() before measuring timings), otherwise you  won't be measuring execution time (which will massively bump up your calculates transfer time).

                    • performance difference between CUDA and OpenCL
                      shingoxlf

                      Hi, the timing function I used is the same in CUDA and OpenCL

                       

                      I have clFlush after each OpenCL kernel and here is the code for one kernel, the other kernels are similare:

                       

                      time_temp = rtclock(); errcode = clSetKernelArg(clKernel_brcurrents, 0, sizeof(int), (void *)&stimnum); errcode |= clSetKernelArg(clKernel_brcurrents, 1, sizeof(cl_mem), (void *)&d_datarr); errcode |= clSetKernelArg(clKernel_brcurrents, 2, sizeof(cl_mem), (void *)&d_derivarr); errcode |= clSetKernelArg(clKernel_brcurrents, 3, sizeof(int), (void *)&step); errcode |= clSetKernelArg(clKernel_brcurrents, 4, sizeof(double), (void *)&Istimamp); errcode |= clSetKernelArg(clKernel_brcurrents, 5, sizeof(int), (void *)&Ny); errcode |= clSetKernelArg(clKernel_brcurrents, 6, sizeof(int), (void *)&varnum); errcode |= clSetKernelArg(clKernel_brcurrents, 7, sizeof(cl_mem), (void *)&d_constarr); errcode |= clSetKernelArg(clKernel_brcurrents, 8, sizeof(cl_mem), (void *)&d_Afield); errcode |= clSetKernelArg(clKernel_brcurrents, 9, sizeof(cl_mem), (void *)&d_block); errcode |= clSetKernelArg(clKernel_brcurrents, 10, sizeof(cl_mem), (void *)&d_Dp); errcode |= clSetKernelArg(clKernel_brcurrents, 11, sizeof(double), (void *)&dt); errcode |= clSetKernelArg(clKernel_brcurrents, 12, sizeof(cl_mem), (void *)&vm_gpu); if(errcode != CL_SUCCESS) printf("Error in seting arguments for kernel brcurrents\n"); // Execute the OpenCL kernel errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel_brcurrents, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); if(errcode != CL_SUCCESS) printf("Error in launching kernel\n"); errcode = clFlush(clCommandQue); cur_time += (double)(rtclock()-time_temp);

                        • performance difference between CUDA and OpenCL
                          shingoxlf

                          And here is the timing function:

                          double rtclock() { struct timezone Tzp; struct timeval Tp; int stat; stat = gettimeofday (&Tp, &Tzp); if (stat != 0) printf("Error return from gettimeofday: %d",stat); return(Tp.tv_sec + Tp.tv_usec*1.0e-6); }

                            • performance difference between CUDA and OpenCL
                              adm271828

                              According to OpenCL "clFlush only guarantees that all queued commands to command_queue get issued to the appropriate device. There is no guarantee that they will be complete after clFlush returns." ...

                              Which means your timing measure yields unspecified result.

                              You should use clFinish instead (or clWaitForEvents).

                                • performance difference between CUDA and OpenCL
                                  antzrhere

                                  Yes, as I expected, what adm271828 said is correct. You need to call clFinish() before each timer read as your only measuring the time to dispatch an execution event to the command queue. Because your probably performing a blocking read, that's why EnqueReadbuffer is taking so long - it's having to wait for the execution to finish before it even starts. ClFinish(),  ClWaitForEvents() or passing an *event_wait_list for clEnqueueNDRangeKernel() and making all reads/writing blocking will do the job.

                                  • performance difference between CUDA and OpenCL
                                    shingoxlf

                                    Thanks, I replaced clFlush with clFinish, and now it is working:

                                     

                                    total         time is 187.09144616

                                    Kernel stim   time is 0.00000000

                                    Kernel cur    time is 32.88452005

                                    Kernel gate   time is 121.27778316

                                    Kernel atomic time is 15.18993449

                                    Kernel bcs    time is 14.02507806

                                    memory copy   time is 2.76019478

                                    GPU           time is 184.23980522



                                      • performance difference between CUDA and OpenCL
                                        antzrhere

                                        Yes, they look better. Interestingly, it does show that Nvidia's OpenCL SDK is quite abit behind the CUDA counterpart in terms of performance. I suppose it's to be expected being less mature, but the difference is quite staggering. Even memory copy times are very different, more than you could put down to the different behaviour of clFInish() on timings..