cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

shingoxlf
Journeyman III

performance difference between CUDA and OpenCL

Hi all, I had a heart simulation program implemented in CUDA and OpenCL.

I profiled the kernels and time spending on memcpy as in the attached code.

And I did the similare thing for OpenCL code, however, the OpenCL profiling seems 99.99% time is spending on memory copy while CUDA is spending 1% on memory copy.

 

Here is the profiling for OpenCL:

gpu time = total time - mem copy time

 

total         time is 176.00000000

Kernel stim   time is 0.00000000

Kernel cur    time is 0.01428032

Kernel gate   time is 0.01075721

Kernel bcs    time is 0.01370287

memory copy   time is 175.97290492

GPU           time is 0.02709508

 

 

 

 

and here is the profling for CUDA:

 

total         time is 101.22078705

Kernel stim   time is 0.00289774

Kernel cur    time is 67.28576040

Kernel gate   time is 20.97523451

Kernel bcs    time is 10.90470386

memory copy   time is 1.55556965

GPU           time is 99.66521740

Can anybody tell me what caused this difference?

 

while (derivarr[0] <= tfinal && step <= Nsteps + 1) { // from (1 to Nx) instead of (0 to Nx+1) // do not loop through ghost points */ //GPU Kernel Execution time_temp = rtclock(); if(stimnum>0) d_stimulate_kernel<<<dimGrid,dimBlock>>>(stimnum,d_datarr,d_stimarr,d_derivarr,varnum,step,Istimamp,Ny,stimint); cudaThreadSynchronize(); stim_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if(blocktimenum>0) d_blockonoff_kernel<<<dimGrid,dimBlock>>>(blocktimenum, d_derivarr, d_blocktimes, d_block, Nx, Ny); cudaThreadSynchronize(); block_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brgates_kernel<<<dimGrid,dimBlock>>>(varnum, d_datarr, d_derivarr, d_constarr, step, Ny); cudaThreadSynchronize(); gate_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brcurrents_kernel<<<dimGrid,dimBlock>>>(stimnum, d_datarr, d_derivarr, step, Istimamp, Ny, varnum, d_constarr, d_Afield, d_block, d_Dp, dt); cudaThreadSynchronize(); cur_time += (double)(rtclock()-time_temp); time_temp = rtclock(); dim3 dimGrid1(1,1,1); dim3 dimBlock1(1,1,1); kernel_call_device_bcs<<< dimGrid1, dimBlock1 >>>(dx, dy, d_D, BC, step, Nx, Ny, varnum, d_Dp, d_datarr, d_derivarr, dt); cudaThreadSynchronize(); cutilCheckMsg("CUDA Kernel"); bcs_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if (step % rpN == 0) { // Coalescing cudaMemcpy cutilSafeCall(cudaMemcpy(linear_datarr, d_datarr, (Nx+2)*(Ny+2)*varnum*2*sizeof(double), cudaMemcpyDeviceToHost)); // copy host memory to device for (i = 0; i < (Nx+2); i++) { for (j = 0; j < (Ny+2); j++) { for (k = 0; k < varnum; k++) { for (int l = 0; l < 2; l++) { datarr = *(linear_datarr+ i*(Ny+2)*varnum*2+ j*varnum*2+ k*2+ l); } } } } output(); printf("%4.4e msec, Vm(%d,%d): %3.2f mV GPU\n", derivarr[0], mNx, mNy, datarr[mNx][mNy][0][step%2]); } mem_time += (double)(rtclock()-time_temp); step++; derivarr[0] += dt; deriv3darr[0][0][0] += dt; // update time (msec) } double gpu_end = rtclock(); printf("total time is %.8lf\n",(double)(gpu_end-gpu_start)); printf("Kernel stim time is %.8lf\n",stim_time); printf("Kernel cur time is %.8lf\n",cur_time); printf("Kernel gate time is %.8lf\n",gate_time); printf("Kernel bcs time is %.8lf\n",bcs_time); printf("memory copy time is %.8lf\n",mem_time); printf("GPU time is %.8lf\n",(double)(gpu_end-gpu_start)-mem_time);

0 Likes
9 Replies
stgatilov
Journeyman III

I think OpenCL's timing is much stranger than CUDA's.

Seeing OpenCL source would be much more helpful=)

Perhaps you missed synchronization in OpenCL code...

0 Likes

Your timings must be incorrect. Apparently Kernel cur is nearly 471200% faster in OpenCL than CUDA. Unless your emulating CUDA code on an Amiga 500,I find that difficult to believe.

0 Likes

And why are you posting your Nvidia CUDA code on an AMD OpenCL forum...surely you should be posting your OpenCL code?

Make sure all your Kernel calls are blocking (or call clFinish() before measuring timings), otherwise you  won't be measuring execution time (which will massively bump up your calculates transfer time).

0 Likes

Hi, the timing function I used is the same in CUDA and OpenCL

 

I have clFlush after each OpenCL kernel and here is the code for one kernel, the other kernels are similare:

 

time_temp = rtclock(); errcode = clSetKernelArg(clKernel_brcurrents, 0, sizeof(int), (void *)&stimnum); errcode |= clSetKernelArg(clKernel_brcurrents, 1, sizeof(cl_mem), (void *)&d_datarr); errcode |= clSetKernelArg(clKernel_brcurrents, 2, sizeof(cl_mem), (void *)&d_derivarr); errcode |= clSetKernelArg(clKernel_brcurrents, 3, sizeof(int), (void *)&step); errcode |= clSetKernelArg(clKernel_brcurrents, 4, sizeof(double), (void *)&Istimamp); errcode |= clSetKernelArg(clKernel_brcurrents, 5, sizeof(int), (void *)&Ny); errcode |= clSetKernelArg(clKernel_brcurrents, 6, sizeof(int), (void *)&varnum); errcode |= clSetKernelArg(clKernel_brcurrents, 7, sizeof(cl_mem), (void *)&d_constarr); errcode |= clSetKernelArg(clKernel_brcurrents, 8, sizeof(cl_mem), (void *)&d_Afield); errcode |= clSetKernelArg(clKernel_brcurrents, 9, sizeof(cl_mem), (void *)&d_block); errcode |= clSetKernelArg(clKernel_brcurrents, 10, sizeof(cl_mem), (void *)&d_Dp); errcode |= clSetKernelArg(clKernel_brcurrents, 11, sizeof(double), (void *)&dt); errcode |= clSetKernelArg(clKernel_brcurrents, 12, sizeof(cl_mem), (void *)&vm_gpu); if(errcode != CL_SUCCESS) printf("Error in seting arguments for kernel brcurrents\n"); // Execute the OpenCL kernel errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel_brcurrents, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); if(errcode != CL_SUCCESS) printf("Error in launching kernel\n"); errcode = clFlush(clCommandQue); cur_time += (double)(rtclock()-time_temp);

0 Likes

And here is the timing function:

double rtclock() { struct timezone Tzp; struct timeval Tp; int stat; stat = gettimeofday (&Tp, &Tzp); if (stat != 0) printf("Error return from gettimeofday: %d",stat); return(Tp.tv_sec + Tp.tv_usec*1.0e-6); }

0 Likes

According to OpenCL "clFlush only guarantees that all queued commands to command_queue get issued to the appropriate device. There is no guarantee that they will be complete after clFlush returns." ...

Which means your timing measure yields unspecified result.

You should use clFinish instead (or clWaitForEvents).

0 Likes

Yes, as I expected, what adm271828 said is correct. You need to call clFinish() before each timer read as your only measuring the time to dispatch an execution event to the command queue. Because your probably performing a blocking read, that's why EnqueReadbuffer is taking so long - it's having to wait for the execution to finish before it even starts. ClFinish(),  ClWaitForEvents() or passing an *event_wait_list for clEnqueueNDRangeKernel() and making all reads/writing blocking will do the job.

0 Likes

Thanks, I replaced clFlush with clFinish, and now it is working:

total         time is 187.09144616

Kernel stim   time is 0.00000000

Kernel cur    time is 32.88452005

Kernel gate   time is 121.27778316

Kernel atomic time is 15.18993449

Kernel bcs    time is 14.02507806

memory copy   time is 2.76019478

GPU           time is 184.23980522



0 Likes

Yes, they look better. Interestingly, it does show that Nvidia's OpenCL SDK is quite abit behind the CUDA counterpart in terms of performance. I suppose it's to be expected being less mature, but the difference is quite staggering. Even memory copy times are very different, more than you could put down to the different behaviour of clFInish() on timings..

0 Likes