Hi all, I had a heart simulation program implemented in CUDA and OpenCL.
I profiled the kernels and time spending on memcpy as in the attached code.
And I did the similare thing for OpenCL code, however, the OpenCL profiling seems 99.99% time is spending on memory copy while CUDA is spending 1% on memory copy.
Here is the profiling for OpenCL:
gpu time = total time - mem copy time
total time is 176.00000000
Kernel stim time is 0.00000000
Kernel cur time is 0.01428032
Kernel gate time is 0.01075721
Kernel bcs time is 0.01370287
memory copy time is 175.97290492
GPU time is 0.02709508
and here is the profling for CUDA:
total time is 101.22078705
Kernel stim time is 0.00289774
Kernel cur time is 67.28576040
Kernel gate time is 20.97523451
Kernel bcs time is 10.90470386
memory copy time is 1.55556965
GPU time is 99.66521740
while (derivarr[0] <= tfinal && step <= Nsteps + 1) { // from (1 to Nx) instead of (0 to Nx+1) // do not loop through ghost points */ //GPU Kernel Execution time_temp = rtclock(); if(stimnum>0) d_stimulate_kernel<<<dimGrid,dimBlock>>>(stimnum,d_datarr,d_stimarr,d_derivarr,varnum,step,Istimamp,Ny,stimint); cudaThreadSynchronize(); stim_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if(blocktimenum>0) d_blockonoff_kernel<<<dimGrid,dimBlock>>>(blocktimenum, d_derivarr, d_blocktimes, d_block, Nx, Ny); cudaThreadSynchronize(); block_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brgates_kernel<<<dimGrid,dimBlock>>>(varnum, d_datarr, d_derivarr, d_constarr, step, Ny); cudaThreadSynchronize(); gate_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brcurrents_kernel<<<dimGrid,dimBlock>>>(stimnum, d_datarr, d_derivarr, step, Istimamp, Ny, varnum, d_constarr, d_Afield, d_block, d_Dp, dt); cudaThreadSynchronize(); cur_time += (double)(rtclock()-time_temp); time_temp = rtclock(); dim3 dimGrid1(1,1,1); dim3 dimBlock1(1,1,1); kernel_call_device_bcs<<< dimGrid1, dimBlock1 >>>(dx, dy, d_D, BC, step, Nx, Ny, varnum, d_Dp, d_datarr, d_derivarr, dt); cudaThreadSynchronize(); cutilCheckMsg("CUDA Kernel"); bcs_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if (step % rpN == 0) { // Coalescing cudaMemcpy cutilSafeCall(cudaMemcpy(linear_datarr, d_datarr, (Nx+2)*(Ny+2)*varnum*2*sizeof(double), cudaMemcpyDeviceToHost)); // copy host memory to device for (i = 0; i < (Nx+2); i++) { for (j = 0; j < (Ny+2); j++) { for (k = 0; k < varnum; k++) { for (int l = 0; l < 2; l++) { datarr
= *(linear_datarr+ i*(Ny+2)*varnum*2+ j*varnum*2+ k*2+ l); } } } } output(); printf("%4.4e msec, Vm(%d,%d): %3.2f mV GPU\n", derivarr[0], mNx, mNy, datarr[mNx][mNy][0][step%2]); } mem_time += (double)(rtclock()-time_temp); step++; derivarr[0] += dt; deriv3darr[0][0][0] += dt; // update time (msec) } double gpu_end = rtclock(); printf("total time is %.8lf\n",(double)(gpu_end-gpu_start)); printf("Kernel stim time is %.8lf\n",stim_time); printf("Kernel cur time is %.8lf\n",cur_time); printf("Kernel gate time is %.8lf\n",gate_time); printf("Kernel bcs time is %.8lf\n",bcs_time); printf("memory copy time is %.8lf\n",mem_time); printf("GPU time is %.8lf\n",(double)(gpu_end-gpu_start)-mem_time);
I think OpenCL's timing is much stranger than CUDA's.
Seeing OpenCL source would be much more helpful=)
Perhaps you missed synchronization in OpenCL code...
Your timings must be incorrect. Apparently Kernel cur is nearly 471200% faster in OpenCL than CUDA. Unless your emulating CUDA code on an Amiga 500,I find that difficult to believe.
And why are you posting your Nvidia CUDA code on an AMD OpenCL forum...surely you should be posting your OpenCL code?
Make sure all your Kernel calls are blocking (or call clFinish() before measuring timings), otherwise you won't be measuring execution time (which will massively bump up your calculates transfer time).
Hi, the timing function I used is the same in CUDA and OpenCL
I have clFlush after each OpenCL kernel and here is the code for one kernel, the other kernels are similare:
time_temp = rtclock(); errcode = clSetKernelArg(clKernel_brcurrents, 0, sizeof(int), (void *)&stimnum); errcode |= clSetKernelArg(clKernel_brcurrents, 1, sizeof(cl_mem), (void *)&d_datarr); errcode |= clSetKernelArg(clKernel_brcurrents, 2, sizeof(cl_mem), (void *)&d_derivarr); errcode |= clSetKernelArg(clKernel_brcurrents, 3, sizeof(int), (void *)&step); errcode |= clSetKernelArg(clKernel_brcurrents, 4, sizeof(double), (void *)&Istimamp); errcode |= clSetKernelArg(clKernel_brcurrents, 5, sizeof(int), (void *)&Ny); errcode |= clSetKernelArg(clKernel_brcurrents, 6, sizeof(int), (void *)&varnum); errcode |= clSetKernelArg(clKernel_brcurrents, 7, sizeof(cl_mem), (void *)&d_constarr); errcode |= clSetKernelArg(clKernel_brcurrents, 8, sizeof(cl_mem), (void *)&d_Afield); errcode |= clSetKernelArg(clKernel_brcurrents, 9, sizeof(cl_mem), (void *)&d_block); errcode |= clSetKernelArg(clKernel_brcurrents, 10, sizeof(cl_mem), (void *)&d_Dp); errcode |= clSetKernelArg(clKernel_brcurrents, 11, sizeof(double), (void *)&dt); errcode |= clSetKernelArg(clKernel_brcurrents, 12, sizeof(cl_mem), (void *)&vm_gpu); if(errcode != CL_SUCCESS) printf("Error in seting arguments for kernel brcurrents\n"); // Execute the OpenCL kernel errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel_brcurrents, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); if(errcode != CL_SUCCESS) printf("Error in launching kernel\n"); errcode = clFlush(clCommandQue); cur_time += (double)(rtclock()-time_temp);
And here is the timing function:
double rtclock() { struct timezone Tzp; struct timeval Tp; int stat; stat = gettimeofday (&Tp, &Tzp); if (stat != 0) printf("Error return from gettimeofday: %d",stat); return(Tp.tv_sec + Tp.tv_usec*1.0e-6); }
According to OpenCL "clFlush
only guarantees that all queued commands to command_queue
get issued to the appropriate device. There is no guarantee that they will be complete after clFlush
returns." ...
Which means your timing measure yields unspecified result.
You should use clFinish instead (or clWaitForEvents).
Yes, as I expected, what adm271828 said is correct. You need to call clFinish() before each timer read as your only measuring the time to dispatch an execution event to the command queue. Because your probably performing a blocking read, that's why EnqueReadbuffer is taking so long - it's having to wait for the execution to finish before it even starts. ClFinish(), ClWaitForEvents() or passing an *event_wait_list for clEnqueueNDRangeKernel() and making all reads/writing blocking will do the job.
Thanks, I replaced clFlush with clFinish, and now it is working:
total time is 187.09144616
Kernel stim time is 0.00000000
Kernel cur time is 32.88452005
Kernel gate time is 121.27778316
Kernel atomic time is 15.18993449
Kernel bcs time is 14.02507806
memory copy time is 2.76019478
GPU time is 184.23980522
Yes, they look better. Interestingly, it does show that Nvidia's OpenCL SDK is quite abit behind the CUDA counterpart in terms of performance. I suppose it's to be expected being less mature, but the difference is quite staggering. Even memory copy times are very different, more than you could put down to the different behaviour of clFInish() on timings..