    Not Seeing High throughputs of the Zero Copy on the APU




      I have a AMD A8-3850 fusion APU. I have installed the AMD 2.6 SDK and trying out the BufferBandwidth application to check the maximum

      bandwidths that I can get on the discrete and the on-die GPU.

      I have no issues with the discrete GPU. But for the on-die GPU, this blog


      as well as the AMD APP programmers guide talk of the zero copy path which can reach upto 15 GBps.


      Firstly I would like to get it clarified as to whether the API referred to in the blog and the APP guide is "clMapBuffer" or its really "clEnqueueMapBuffer".

      Coz there is no such API called clMapBuffer in the library. So I am assuming its clEnqueueMapBuffer.

      If that is true, I am trying out BufferBandwidth application that came with the SDK with various options to see If I can get upto 15 GBps.

      But so far I have only got upto 8 GBps.

      As suggested in the comment of the above blog, I even tried with -nwk as 15. still unable to see any improvement.

      here is my command line for buffer size of 128MB.


      ./BufferBandwidth -t 2 -d 0 -nwk 15 -nl 20 -nr 1 -nk 20 -nb 134217728 -nw 7 -s 2 -if 0 -of 1 -cf 5 -cf 2


      and here is the output for Integrated GPU



      Device 0            BeaverCreek

      Build:               DEBUG

      GPU work items:      8192

      Buffer size:         134217728

      CPU workers:         1

      Timing loops:        20

      Repeats:             1

      Kernel loops:        20

      inputBuffer:         CL_MEM_READ_ONLY

      outputBuffer:        CL_MEM_WRITE_ONLY

      copyBuffer:          CL_MEM_READ_WRITECL_MEM_ALLOC_HOST_PTR

      Host baseline (single thread, naive):

      Timer resolution  256.225 ns

      Page fault  2047.7

      CPU read   6.16153 GB/s

      memcpy()   3.31839 GB/s

      memset(,1,)   9.00288 GB/s

      memset(,0,)   8.81274 GB/s

      AVERAGES (over loops 2 - 19, use -l for complete log)


        1. Host mapped write to copyBuffer

            clEnqueueMapBuffer(WRITE):            0.000016 s [  8285.50 GB/s ]

            memset():  0.014970 s       8.97 GB/s

            clEnqueueUnmapMemObject():  0.000084 s [  1604.42 GB/s ]

        2. CL copy of copyBuffer to inputBuffer

            clEnqueueCopyBuffer:  0.042168 s       3.18 GB/s

        3. GPU kernel read of inputBuffer

            clEnqueueNDRangeKernel():  0.471207 s       5.70 GB/s

           verification ok

        4. GPU kernel write to outputBuffer

            clEnqueueNDRangeKernel():  0.665229 s       4.04 GB/s

        5. CL copy of outputBuffer to copyBuffer

            clEnqueueCopyBuffer:  0.041343 s       3.25 GB/s

        6. Host mapped read of copyBuffer

            clEnqueueMapBuffer(READ):  0.000017 s [  7710.12 GB/s ]

            CPU read:  0.023021 s       5.83 GB/s

            verification ok

        clEnqueueUnmapMemObject():  0.000089 s [  1506.57 GB/s ]



      Is there anything else that I need to enable in order to get higher bandwidths?