shingoxlf

performance difference between CUDA and OpenCL

Discussion created by shingoxlf on Oct 15, 2011
Latest reply on Oct 15, 2011 by antzrhere

Hi all, I had a heart simulation program implemented in CUDA and OpenCL.

I profiled the kernels and time spending on memcpy as in the attached code.

And I did the similare thing for OpenCL code, however, the OpenCL profiling seems 99.99% time is spending on memory copy while CUDA is spending 1% on memory copy.

 

Here is the profiling for OpenCL:

gpu time = total time - mem copy time

 

total         time is 176.00000000

Kernel stim   time is 0.00000000

Kernel cur    time is 0.01428032

Kernel gate   time is 0.01075721

Kernel bcs    time is 0.01370287

memory copy   time is 175.97290492

GPU           time is 0.02709508

 

 

 

 

and here is the profling for CUDA:

 

total         time is 101.22078705

Kernel stim   time is 0.00289774

Kernel cur    time is 67.28576040

Kernel gate   time is 20.97523451

Kernel bcs    time is 10.90470386

memory copy   time is 1.55556965

GPU           time is 99.66521740

Can anybody tell me what caused this difference?

 

while (derivarr[0] <= tfinal && step <= Nsteps + 1) { // from (1 to Nx) instead of (0 to Nx+1) // do not loop through ghost points */ //GPU Kernel Execution time_temp = rtclock(); if(stimnum>0) d_stimulate_kernel<<<dimGrid,dimBlock>>>(stimnum,d_datarr,d_stimarr,d_derivarr,varnum,step,Istimamp,Ny,stimint); cudaThreadSynchronize(); stim_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if(blocktimenum>0) d_blockonoff_kernel<<<dimGrid,dimBlock>>>(blocktimenum, d_derivarr, d_blocktimes, d_block, Nx, Ny); cudaThreadSynchronize(); block_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brgates_kernel<<<dimGrid,dimBlock>>>(varnum, d_datarr, d_derivarr, d_constarr, step, Ny); cudaThreadSynchronize(); gate_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brcurrents_kernel<<<dimGrid,dimBlock>>>(stimnum, d_datarr, d_derivarr, step, Istimamp, Ny, varnum, d_constarr, d_Afield, d_block, d_Dp, dt); cudaThreadSynchronize(); cur_time += (double)(rtclock()-time_temp); time_temp = rtclock(); dim3 dimGrid1(1,1,1); dim3 dimBlock1(1,1,1); kernel_call_device_bcs<<< dimGrid1, dimBlock1 >>>(dx, dy, d_D, BC, step, Nx, Ny, varnum, d_Dp, d_datarr, d_derivarr, dt); cudaThreadSynchronize(); cutilCheckMsg("CUDA Kernel"); bcs_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if (step % rpN == 0) { // Coalescing cudaMemcpy cutilSafeCall(cudaMemcpy(linear_datarr, d_datarr, (Nx+2)*(Ny+2)*varnum*2*sizeof(double), cudaMemcpyDeviceToHost)); // copy host memory to device for (i = 0; i < (Nx+2); i++) { for (j = 0; j < (Ny+2); j++) { for (k = 0; k < varnum; k++) { for (int l = 0; l < 2; l++) { datarr[i][j][k][l] = *(linear_datarr+ i*(Ny+2)*varnum*2+ j*varnum*2+ k*2+ l); } } } } output(); printf("%4.4e msec, Vm(%d,%d): %3.2f mV GPU\n", derivarr[0], mNx, mNy, datarr[mNx][mNy][0][step%2]); } mem_time += (double)(rtclock()-time_temp); step++; derivarr[0] += dt; deriv3darr[0][0][0] += dt; // update time (msec) } double gpu_end = rtclock(); printf("total time is %.8lf\n",(double)(gpu_end-gpu_start)); printf("Kernel stim time is %.8lf\n",stim_time); printf("Kernel cur time is %.8lf\n",cur_time); printf("Kernel gate time is %.8lf\n",gate_time); printf("Kernel bcs time is %.8lf\n",bcs_time); printf("memory copy time is %.8lf\n",mem_time); printf("GPU time is %.8lf\n",(double)(gpu_end-gpu_start)-mem_time);

Outcomes