1 Reply Latest reply on Mar 18, 2010 7:52 PM by Raistmer

    Inadequate times for memory transfers

    Raistmer
      Why they so big and so differ??

      I use memory transfer from pinned memory buffer in host memory to GPU memory and measure each stage of this transfer: mapping to host memory, unmapping and, finally, copying updated buffer to GPU memory.

      What times I recived (in ns):
      mapped/copied region size 4*2*32k*7=1792kB.
      DataMap_ns: total=2.869e+008, N=4688, <>=6.121e+004, min=524 max=1.8e+007
      DataUnmap_ns: total=1.591e+011, N=4688, <>=3.393e+007, min=3.356e+007 max=3.557e+007
      DataCopy_ns: total=1.956e+008, N=4688, <>=4.172e+004, min=2.933e+004 max=6.052e+005

      Especially interesting mapping: from 524ns to 18ms variation!
      Why???
      And data copying itself takes less time in average than map/unmap buffer! Something wrong here....

      [GPU as usual, HD4870 + CPU Q9450]
      Code for these sections:


      Mapping: {Timings<T_DataMap> counter;cl_event ev; data_range=(ap_complex*)clEnqueueMapBuffer(cq, cpu_pinned_buf, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, sizeof(ap_complex)*state.fft_len*DATA_CHUNK_UNROLL, 0, NULL, &ev, &err); if(err != CL_SUCCESS)fprintf(stderr,"ERROR: clEnqueueMapBuffer (data_range): %d\n",err); #if 1 if(ev){ cl_ulong start,end; err=clWaitForEvents(1,&ev); err|=clGetEventProfilingInfo (ev,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (ev,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); Counters<T_DataMap_ns,cl_ulong>::update(end-start); //fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(ev);ev=NULL; if(err != CL_SUCCESS)fprintf(stderr,"ERROR: DataCopy event: %d\n",err); } #endif } Unmapping: {Timings<T_DataUnmap> counter1;cl_event ev; err=clEnqueueUnmapMemObject(cq,cpu_pinned_buf,data_range,0,NULL,&ev);data_range=NULL; #if 1 if(ev){ cl_ulong start,end; err=clWaitForEvents(1,&ev); err|=clGetEventProfilingInfo (ev,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (ev,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); Counters<T_DataUnmap_ns,cl_ulong>::update(end-start); //fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(ev);ev=NULL; if(err != CL_SUCCESS)fprintf(stderr,"ERROR: DataCopy event: %d\n",err); } #endif } Copying: {Timings<T_DataCopy> counter;cl_event ev; err|=clEnqueueCopyBuffer(cq,cpu_pinned_buf,gpu_data,0,0,sizeof(ap_complex)*fft_len*DATA_CHUNK_UNROLL,0, NULL, &ev); if(err != CL_SUCCESS)fprintf(stderr,"ERROR: CopyBuffer(gpu_data): %d\n",err); #if 1 if(ev){ cl_ulong start,end; err=clWaitForEvents(1,&ev); err|=clGetEventProfilingInfo (ev,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (ev,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); Counters<T_DataCopy_ns,cl_ulong>::update(end-start); //fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(ev);ev=NULL; if(err != CL_SUCCESS)fprintf(stderr,"ERROR: DataCopy event: %d\n",err); } #endif }

        • Inadequate times for memory transfers
          Raistmer
          I did modification to code that uses same pre-allocated host memory object but instead of map/update/unmap/copy does update/write sequence (mapping was done one time at startup code).

          Timings differs very much.

          class T_DataMap: total=6.93e+011, N=4688, <>=1.48e+008, min=4.83e+007, max=1.92e+008
          class T_DataMap_ns: total=3.325e+008, N=4688, <>=7.093e+004, min=524 max=2.727e+007
          class T_DataUnmap: total=4.36e+011, N=4688, <>=9.30e+007, min=8.93e+007, max=1.23e+008
          class T_DataUnmap_ns: total=1.622e+011, N=4688, <>=3.46e+007, min=3.336e+007 max=3.717e+007
          class T_DataCopy: total=6.33e+010, N=4688, <>=1.35e+007, min=6.26e+006, max=3.36e+007
          class T_DataCopy_ns: total=4.607e+008, N=4688, <>=9.828e+004, min=3.324e+004 max=1.157e+006
          vs
          class T_DataWrite: total=6.91e+010, N=4688, <>=1.47e+007, min=8.68e+006, max=3.88e+007
          class T_DataWrite_ns: total=1.084e+010, N=4688, <>=2.312e+006, min=2.249e+006 max=8.224e+006

          *_ns are values provided by OpenCL events, in ns, other counters are in CPU (Q9450@2.66GHz) ticks based on RDTSC instruction.

          (and application running time shortened from
          1192.575 secs Elapsed
          to
          710.981 secs Elapsed
          (on same test load)
          It's clear IMO that current mapping/unmapping algorithm used by runtime very inefficient.

          (btw, same app build for NV cards shows much better timing even being run on less powerful (at least regarding ATI's claims) GPU. Time to bettering SDK !