Archives Discussions

shingoxlf · ‎10-15-2011

Hi all, I had a heart simulation program implemented in CUDA and OpenCL.

I profiled the kernels and time spending on memcpy as in the attached code.

And I did the similare thing for OpenCL code, however, the OpenCL profiling seems 99.99% time is spending on memory copy while CUDA is spending 1% on memory copy.

Here is the profiling for OpenCL:

gpu time = total time - mem copy time

total time is 176.00000000

Kernel stim time is 0.00000000

Kernel cur time is 0.01428032

Kernel gate time is 0.01075721

Kernel bcs time is 0.01370287

memory copy time is 175.97290492

GPU time is 0.02709508

and here is the profling for CUDA:

total time is 101.22078705

Kernel stim time is 0.00289774

Kernel cur time is 67.28576040

Kernel gate time is 20.97523451

Kernel bcs time is 10.90470386

memory copy time is 1.55556965

GPU time is 99.66521740

Can anybody tell me what caused this difference?

while (derivarr[0] <= tfinal && step <= Nsteps + 1) { // from (1 to Nx) instead of (0 to Nx+1) // do not loop through ghost points */ //GPU Kernel Execution time_temp = rtclock(); if(stimnum>0) d_stimulate_kernel<<<dimGrid,dimBlock>>>(stimnum,d_datarr,d_stimarr,d_derivarr,varnum,step,Istimamp,Ny,stimint); cudaThreadSynchronize(); stim_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if(blocktimenum>0) d_blockonoff_kernel<<<dimGrid,dimBlock>>>(blocktimenum, d_derivarr, d_blocktimes, d_block, Nx, Ny); cudaThreadSynchronize(); block_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brgates_kernel<<<dimGrid,dimBlock>>>(varnum, d_datarr, d_derivarr, d_constarr, step, Ny); cudaThreadSynchronize(); gate_time += (double)(rtclock()-time_temp); time_temp = rtclock(); d_brcurrents_kernel<<<dimGrid,dimBlock>>>(stimnum, d_datarr, d_derivarr, step, Istimamp, Ny, varnum, d_constarr, d_Afield, d_block, d_Dp, dt); cudaThreadSynchronize(); cur_time += (double)(rtclock()-time_temp); time_temp = rtclock(); dim3 dimGrid1(1,1,1); dim3 dimBlock1(1,1,1); kernel_call_device_bcs<<< dimGrid1, dimBlock1 >>>(dx, dy, d_D, BC, step, Nx, Ny, varnum, d_Dp, d_datarr, d_derivarr, dt); cudaThreadSynchronize(); cutilCheckMsg("CUDA Kernel"); bcs_time += (double)(rtclock()-time_temp); time_temp = rtclock(); if (step % rpN == 0) { // Coalescing cudaMemcpy cutilSafeCall(cudaMemcpy(linear_datarr, d_datarr, (Nx+2)*(Ny+2)*varnum*2*sizeof(double), cudaMemcpyDeviceToHost)); // copy host memory to device for (i = 0; i < (Nx+2); i++) { for (j = 0; j < (Ny+2); j++) { for (k = 0; k < varnum; k++) { for (int l = 0; l < 2; l++) { datarr = *(linear_datarr+ i*(Ny+2)*varnum*2+ j*varnum*2+ k*2+ l); } } } } output(); printf("%4.4e msec, Vm(%d,%d): %3.2f mV GPU\n", derivarr[0], mNx, mNy, datarr[mNx][mNy][0][step%2]); } mem_time += (double)(rtclock()-time_temp); step++; derivarr[0] += dt; deriv3darr[0][0][0] += dt; // update time (msec) } double gpu_end = rtclock(); printf("total time is %.8lf\n",(double)(gpu_end-gpu_start)); printf("Kernel stim time is %.8lf\n",stim_time); printf("Kernel cur time is %.8lf\n",cur_time); printf("Kernel gate time is %.8lf\n",gate_time); printf("Kernel bcs time is %.8lf\n",bcs_time); printf("memory copy time is %.8lf\n",mem_time); printf("GPU time is %.8lf\n",(double)(gpu_end-gpu_start)-mem_time);

stgatilov · ‎10-15-2011

I think OpenCL's timing is much stranger than CUDA's.

Seeing OpenCL source would be much more helpful=)

Perhaps you missed synchronization in OpenCL code...

antzrhere · ‎10-15-2011

Your timings must be incorrect. Apparently Kernel cur is nearly 471200% faster in OpenCL than CUDA. Unless your emulating CUDA code on an Amiga 500,I find that difficult to believe.

antzrhere · ‎10-15-2011

And why are you posting your Nvidia CUDA code on an AMD OpenCL forum...surely you should be posting your OpenCL code?

Make sure all your Kernel calls are blocking (or call clFinish() before measuring timings), otherwise you won't be measuring execution time (which will massively bump up your calculates transfer time).

shingoxlf · ‎10-15-2011

Hi, the timing function I used is the same in CUDA and OpenCL

I have clFlush after each OpenCL kernel and here is the code for one kernel, the other kernels are similare:

time_temp = rtclock(); errcode = clSetKernelArg(clKernel_brcurrents, 0, sizeof(int), (void *)&stimnum); errcode |= clSetKernelArg(clKernel_brcurrents, 1, sizeof(cl_mem), (void *)&d_datarr); errcode |= clSetKernelArg(clKernel_brcurrents, 2, sizeof(cl_mem), (void *)&d_derivarr); errcode |= clSetKernelArg(clKernel_brcurrents, 3, sizeof(int), (void *)&step); errcode |= clSetKernelArg(clKernel_brcurrents, 4, sizeof(double), (void *)&Istimamp); errcode |= clSetKernelArg(clKernel_brcurrents, 5, sizeof(int), (void *)&Ny); errcode |= clSetKernelArg(clKernel_brcurrents, 6, sizeof(int), (void *)&varnum); errcode |= clSetKernelArg(clKernel_brcurrents, 7, sizeof(cl_mem), (void *)&d_constarr); errcode |= clSetKernelArg(clKernel_brcurrents, 8, sizeof(cl_mem), (void *)&d_Afield); errcode |= clSetKernelArg(clKernel_brcurrents, 9, sizeof(cl_mem), (void *)&d_block); errcode |= clSetKernelArg(clKernel_brcurrents, 10, sizeof(cl_mem), (void *)&d_Dp); errcode |= clSetKernelArg(clKernel_brcurrents, 11, sizeof(double), (void *)&dt); errcode |= clSetKernelArg(clKernel_brcurrents, 12, sizeof(cl_mem), (void *)&vm_gpu); if(errcode != CL_SUCCESS) printf("Error in seting arguments for kernel brcurrents\n"); // Execute the OpenCL kernel errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel_brcurrents, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); if(errcode != CL_SUCCESS) printf("Error in launching kernel\n"); errcode = clFlush(clCommandQue); cur_time += (double)(rtclock()-time_temp);

shingoxlf · ‎10-15-2011

And here is the timing function:

double rtclock() { struct timezone Tzp; struct timeval Tp; int stat; stat = gettimeofday (&Tp, &Tzp); if (stat != 0) printf("Error return from gettimeofday: %d",stat); return(Tp.tv_sec + Tp.tv_usec*1.0e-6); }

adm271828 · ‎10-15-2011

According to OpenCL "clFlush only guarantees that all queued commands to command_queue get issued to the appropriate device. There is no guarantee that they will be complete after clFlush returns." ...

Which means your timing measure yields unspecified result.

You should use clFinish instead (or clWaitForEvents).

antzrhere · ‎10-15-2011

Yes, as I expected, what adm271828 said is correct. You need to call clFinish() before each timer read as your only measuring the time to dispatch an execution event to the command queue. Because your probably performing a blocking read, that's why EnqueReadbuffer is taking so long - it's having to wait for the execution to finish before it even starts. ClFinish(), ClWaitForEvents() or passing an *event_wait_list for clEnqueueNDRangeKernel() and making all reads/writing blocking will do the job.

shingoxlf · ‎10-15-2011

Thanks, I replaced clFlush with clFinish, and now it is working:

total time is 187.09144616

Kernel stim time is 0.00000000

Kernel cur time is 32.88452005

Kernel gate time is 121.27778316

Kernel atomic time is 15.18993449

Kernel bcs time is 14.02507806

memory copy time is 2.76019478

GPU time is 184.23980522

antzrhere · ‎10-15-2011

Yes, they look better. Interestingly, it does show that Nvidia's OpenCL SDK is quite abit behind the CUDA counterpart in terms of performance. I suppose it's to be expected being less mature, but the difference is quite staggering. Even memory copy times are very different, more than you could put down to the different behaviour of clFInish() on timings..

Archives Discussions

performance difference between CUDA and OpenCL