Hi all, I had a heart simulation program implemented in CUDA and OpenCL.

I profiled the kernels and time spending on memcpy as in the attached code.

And I did the similare thing for OpenCL code, however, the OpenCL profiling seems 99.99% time is spending on memory copy while CUDA is spending 1% on memory copy.

Here is the profiling for OpenCL:

gpu time = total time - mem copy time

total time is 176.00000000

Kernel stim time is 0.00000000

Kernel cur time is 0.01428032

Kernel gate time is 0.01075721

Kernel bcs time is 0.01370287

*memory copy time is 175.97290492*

*GPU time is 0.02709508*

and here is the profling for CUDA:

total time is 101.22078705

Kernel stim time is 0.00289774

Kernel cur time is 67.28576040

Kernel gate time is 20.97523451

Kernel bcs time is 10.90470386

*memory copy time is 1.55556965*

*GPU time is 99.66521740*

while (derivarr[0] <= tfinal && step <= Nsteps + 1) { // from (1 to Nx) instead of (0 to Nx+1) // do not loop through ghost points */ //GPU Kernel Execution time_temp = rtclock(); if(stimnum>0) d_stimulate_kernel<<<dimGrid,dimBlock>>>(stimnum,d_datarr,d_stimarr,d_derivarr,varnum,step,Istimamp,Ny,stimint); cudaThreadSynchronize(); stim_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if(blocktimenum>0) d_blockonoff_kernel<<<dimGrid,dimBlock>>>(blocktimenum, d_derivarr, d_blocktimes, d_block, Nx, Ny); cudaThreadSynchronize(); block_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brgates_kernel<<<dimGrid,dimBlock>>>(varnum, d_datarr, d_derivarr, d_constarr, step, Ny); cudaThreadSynchronize(); gate_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brcurrents_kernel<<<dimGrid,dimBlock>>>(stimnum, d_datarr, d_derivarr, step, Istimamp, Ny, varnum, d_constarr, d_Afield, d_block, d_Dp, dt); cudaThreadSynchronize(); cur_time += (double)(rtclock()-time_temp); time_temp = rtclock(); dim3 dimGrid1(1,1,1); dim3 dimBlock1(1,1,1); kernel_call_device_bcs<<< dimGrid1, dimBlock1 >>>(dx, dy, d_D, BC, step, Nx, Ny, varnum, d_Dp, d_datarr, d_derivarr, dt); cudaThreadSynchronize(); cutilCheckMsg("CUDA Kernel"); bcs_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if (step % rpN == 0) { // Coalescing cudaMemcpy cutilSafeCall(cudaMemcpy(linear_datarr, d_datarr, (Nx+2)*(Ny+2)*varnum*2*sizeof(double), cudaMemcpyDeviceToHost)); // copy host memory to device for (i = 0; i < (Nx+2); i++) { for (j = 0; j < (Ny+2); j++) { for (k = 0; k < varnum; k++) { for (int l = 0; l < 2; l++) { datarr[i][j][k][l] = *(linear_datarr+ i*(Ny+2)*varnum*2+ j*varnum*2+ k*2+ l); } } } } output(); printf("%4.4e msec, Vm(%d,%d): %3.2f mV GPU\n", derivarr[0], mNx, mNy, datarr[mNx][mNy][0][step%2]); } mem_time += (double)(rtclock()-time_temp); step++; derivarr[0] += dt; deriv3darr[0][0][0] += dt; // update time (msec) } double gpu_end = rtclock(); printf("total time is %.8lf\n",(double)(gpu_end-gpu_start)); printf("Kernel stim time is %.8lf\n",stim_time); printf("Kernel cur time is %.8lf\n",cur_time); printf("Kernel gate time is %.8lf\n",gate_time); printf("Kernel bcs time is %.8lf\n",bcs_time); printf("memory copy time is %.8lf\n",mem_time); printf("GPU time is %.8lf\n",(double)(gpu_end-gpu_start)-mem_time);

I think OpenCL's timing is much stranger than CUDA's.

Seeing OpenCL source would be much more helpful=)

Perhaps you missed synchronization in OpenCL code...