AnsweredAssumed Answered

Not Seeing High throughputs of the Zero Copy on the APU

Question asked by thejascr on Mar 24, 2012
Latest reply on Mar 24, 2012 by thejascr

Hi,

 

I have a AMD A8-3850 fusion APU. I have installed the AMD 2.6 SDK and trying out the BufferBandwidth application to check the maximum

bandwidths that I can get on the discrete and the on-die GPU.

I have no issues with the discrete GPU. But for the on-die GPU, this blog

http://blogs.amd.com/developer/2011/08/01/cpu-to-gpu-data-transfers-exceed-15gbs-using-apu-zero-copy-path/

as well as the AMD APP programmers guide talk of the zero copy path which can reach upto 15 GBps.

 

Firstly I would like to get it clarified as to whether the API referred to in the blog and the APP guide is "clMapBuffer" or its really "clEnqueueMapBuffer".

Coz there is no such API called clMapBuffer in the library. So I am assuming its clEnqueueMapBuffer.

If that is true, I am trying out BufferBandwidth application that came with the SDK with various options to see If I can get upto 15 GBps.

But so far I have only got upto 8 GBps.

As suggested in the comment of the above blog, I even tried with -nwk as 15. still unable to see any improvement.

here is my command line for buffer size of 128MB.

 

./BufferBandwidth -t 2 -d 0 -nwk 15 -nl 20 -nr 1 -nk 20 -nb 134217728 -nw 7 -s 2 -if 0 -of 1 -cf 5 -cf 2

 

and here is the output for Integrated GPU

 

=======================================================

Device 0            BeaverCreek

Build:               DEBUG

GPU work items:      8192

Buffer size:         134217728

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLY

outputBuffer:        CL_MEM_WRITE_ONLY

copyBuffer:          CL_MEM_READ_WRITECL_MEM_ALLOC_HOST_PTR

Host baseline (single thread, naive):

Timer resolution  256.225 ns

Page fault  2047.7

CPU read   6.16153 GB/s

memcpy()   3.31839 GB/s

memset(,1,)   9.00288 GB/s

memset(,0,)   8.81274 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

  1. Host mapped write to copyBuffer

      clEnqueueMapBuffer(WRITE):            0.000016 s [  8285.50 GB/s ]

      memset():  0.014970 s       8.97 GB/s

      clEnqueueUnmapMemObject():  0.000084 s [  1604.42 GB/s ]

  2. CL copy of copyBuffer to inputBuffer

      clEnqueueCopyBuffer:  0.042168 s       3.18 GB/s

  3. GPU kernel read of inputBuffer

      clEnqueueNDRangeKernel():  0.471207 s       5.70 GB/s

     verification ok

  4. GPU kernel write to outputBuffer

      clEnqueueNDRangeKernel():  0.665229 s       4.04 GB/s

  5. CL copy of outputBuffer to copyBuffer

      clEnqueueCopyBuffer:  0.041343 s       3.25 GB/s

  6. Host mapped read of copyBuffer

      clEnqueueMapBuffer(READ):  0.000017 s [  7710.12 GB/s ]

      CPU read:  0.023021 s       5.83 GB/s

      verification ok

  clEnqueueUnmapMemObject():  0.000089 s [  1506.57 GB/s ]

  Passed!

 

Is there anything else that I need to enable in order to get higher bandwidths?

 

Thanks

-Thejas

Outcomes