AnsweredAssumed Answered

TransferOverlap not working on Firepro v4800 SDK 2.6 Linux 3.0

Question asked by thejascr on Sep 28, 2012
Latest reply on Oct 2, 2012 by thejascr

Hi,

 

I have a AMD Firepro v4800 and AMD APP SDK 2.6 on linux kernel 3.0.

I am experimenting with the TransferOverlap program of the APP SDK.

But it is not working for me. I see the same running time with both overlap on and off.

I have also run the APP profiler and even there I do not see any overlap.

 

Here is the program output. (

=======================================================================================

 

thejascr@amd-fusion:/opt/bk-AMDAPP/samples/opencl/bin/x86_64$ ./TransferOverlap -d 1 -x 1000 -k 100000 -I 3               // (with overlap)

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : BeaverCreek Device ID is 0x2652660

Device 1 : Redwood Device ID is 0x2811450

Device 2 : Redwood Device ID is 0x2a573a0

Build:               DEBUG

GPU work items:        64

Buffer size:           1024

Timing loops:          50

Kernel loops:          100000

Wavefronts/SIMD:       7

memset/kernel overlap: yes

inputBuffer:           CL_MEM_READ_ONLYCL_MEM_ALLOC_HOST_PTR

 

 

AVERAGES (over loops3 - 49, use -l to show complete log)

------------------------------------------------------------------------------------------

Acquire and fill: inputBuffer1

      clWaitForEvents() + memset()  0.000004 s     0.27 GB/s

       clEnqueueUnmapMemObject()  0.154325 s     0.00 GB/s

  Loop time 0.309886 s

  Launch map: inputBuffer2

     clEnqueueMapBuffer(MAP_WRITE)  0.000025 s     0.04 GB/s

  Verify: resultBuffer2

     clEnqueueMapBuffer(MAP_WRITE)  0.000296 s     0.00 GB/s

                   CPU reduction  0.000001 s

                 verification ok

       clEnqueueUnmapMemObject()  0.000147 s     0.00 GB/s

  Launch GPU kernel: inputBuffer1

          clEnqueueNDRangeKernel()  0.000029 s

  Acquire and fill: inputBuffer2

      clWaitForEvents() + memset()  0.000004 s     0.27 GB/s

       clEnqueueUnmapMemObject()  0.154495 s     0.00 GB/s

  Launch map: inputBuffer1

     clEnqueueMapBuffer(MAP_WRITE)  0.000026 s     0.04 GB/s

  Verify: resultBuffer1

     clEnqueueMapBuffer(MAP_WRITE)  0.000301 s     0.00 GB/s

                   CPU reduction  0.000001 s

                 verification ok

       clEnqueueUnmapMemObject()  0.000145 s     0.00 GB/s

  Launch GPU kernel: inputBuffer2

          clEnqueueNDRangeKernel()  0.000028 s

 

Complete test time:15.5218 s

 

====================================================================================


thejascr@amd-fusion:/opt/bk-AMDAPP/samples/opencl/bin/x86_64$ ./TransferOverlap -d 1 -x 1000 -k 100000 -I 3 -n             // (without overlap)

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : BeaverCreek Device ID is 0x1b20660

Device 1 : Redwood Device ID is 0x1bdca50

Device 2 : Redwood Device ID is 0x1bdbeb0

Build:               DEBUG

GPU work items:        64

Buffer size:           1024

Timing loops:          50

Kernel loops:          100000

Wavefronts/SIMD:       7

memset/kernel overlap: no

inputBuffer:           CL_MEM_READ_ONLYCL_MEM_ALLOC_HOST_PTR

 

AVERAGES (over loops3 - 49, use -l to show complete log)

--------------------------------------------------------------------------------------

Acquire and fill: inputBuffer1

      clWaitForEvents() + memset()  0.000004 s     0.27 GB/s

       clEnqueueUnmapMemObject()  0.000282 s     0.00 GB/s

  Loop time 0.310473 s

  Launch map: inputBuffer2

     clEnqueueMapBuffer(MAP_WRITE)  0.000022 s     0.05 GB/s

  Verify: resultBuffer2

     clEnqueueMapBuffer(MAP_WRITE)  0.000278 s     0.00 GB/s

                   CPU reduction  0.000001 s

                 verification ok

       clEnqueueUnmapMemObject()  0.000140 s     0.00 GB/s

  Launch GPU kernel: inputBuffer1

         clEnqueueNDRangeKernel()  0.154493 s

  Acquire and fill: inputBuffer2

      clWaitForEvents() + memset()  0.000004 s     0.25 GB/s

       clEnqueueUnmapMemObject()  0.000286 s     0.00 GB/s

  Launch map: inputBuffer1

     clEnqueueMapBuffer(MAP_WRITE)  0.000023 s     0.05 GB/s

  Verify: resultBuffer1

     clEnqueueMapBuffer(MAP_WRITE)  0.000284 s     0.00 GB/s

                   CPU reduction  0.000001 s

                 verification ok

       clEnqueueUnmapMemObject()  0.000140 s     0.00 GB/s

  Launch GPU kernel: inputBuffer2

         clEnqueueNDRangeKernel()  0.154432 s

    Complete test time:15.5523 s


Outcomes