2 Replies Latest reply on Apr 23, 2014 8:17 AM by ekondis

    Using CL_MEM_ALLOC_HOST_PTR on buffer for writing output (BufferBandwidth SDK sample)

    ekondis

      Hello,

      I'm experimenting on using host allocated memory buffers for reading or writing data. Specifically I use the BufferBandwidth sample provided with the SDK under 64bit Linux on a HD7750 GPU. If I define small enough buffers the program runs as expected. However, if I use somewhat larger array data the output buffer seems not to be host allocated any more.

       

      Here is the output of array size 130000000 where the output seems normal:

       

       

      ./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 130000000
      Platform found : Advanced Micro Devices, Inc.
      
      
      
      
      Device  0            Capeverde
      Build:               release
      GPU work items:      290176
      Buffer size:         129998848
      CPU workers:         1
      Timing loops:        20
      Repeats:             1
      Kernel loops:        20
      inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR 
      outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR 
      
      
      
      
      AVERAGES (over loops 2 - 19, use -l for complete log)
      --------
      
      
      
      
      1. Host mapped write to inputBuffer
       ---------------------------------------|---------------
       clEnqueueMapBuffer -- WRITE (GBPS)     | 7.27e+03
       ---------------------------------------|---------------
       memset() (GBPS)                        | 4.29
       ---------------------------------------|---------------
       clEnqueueUnmapMemObject() (GBPS)       | 6.94e+03
      
      
      
      
      2. GPU kernel read of inputBuffer
       ---------------------------------------|---------------
       clEnqueueNDRangeKernel() (GBPS)        | 3.03
      
      
       Verification Passed!
      
      
      
      
      3. GPU kernel write to outputBuffer
       ---------------------------------------|---------------
       clEnqueueNDRangeKernel() (GBPS)        | 1.77
      
      
      
      
      4. Host mapped read of outputBuffer
       ---------------------------------------|---------------
       clEnqueueMapBuffer -- READ (GBPS)      | 7.29e+03
       ---------------------------------------|---------------
       CPU read (GBPS)                        | 4.03
       ---------------------------------------|---------------
       clEnqueueUnmapMemObject() (GBPS)       | 5.97e+03
      
      
       Verification Passed!
      

       

      And here is the output when using array size 134000000 where the output buffer seems to be allocated in device memory (high device writing bandwidth, slow when mapped by CPU):

       

       

      ./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 134000000
      Platform found : Advanced Micro Devices, Inc.
      
      
      
      
      Device  0            Capeverde
      Build:               release
      GPU work items:      11648
      Buffer size:         133998592
      CPU workers:         1
      Timing loops:        20
      Repeats:             1
      Kernel loops:        20
      inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR 
      outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR 
      
      
      
      
      AVERAGES (over loops 2 - 19, use -l for complete log)
      --------
      
      
      
      
      1. Host mapped write to inputBuffer
       ---------------------------------------|---------------
       clEnqueueMapBuffer -- WRITE (GBPS)     | 6.25e+03
       ---------------------------------------|---------------
       memset() (GBPS)                        | 4.29
       ---------------------------------------|---------------
       clEnqueueUnmapMemObject() (GBPS)       | 7.13e+03
      
      
      
      
      2. GPU kernel read of inputBuffer
       ---------------------------------------|---------------
       clEnqueueNDRangeKernel() (GBPS)        | 3.02
      
      
       Verification Passed!
      
      
      
      
      3. GPU kernel write to outputBuffer
       ---------------------------------------|---------------
       clEnqueueNDRangeKernel() (GBPS)        | 31.5
      
      
      
      
      4. Host mapped read of outputBuffer
       ---------------------------------------|---------------
       clEnqueueMapBuffer -- READ (GBPS)      | 2.46
       ---------------------------------------|---------------
       CPU read (GBPS)                        | 4.03
       ---------------------------------------|---------------
       clEnqueueUnmapMemObject() (GBPS)       | 5.39e+03
      
      
       Verification Passed!
      
      
      
      
      Passed!
      
        • Re: Using CL_MEM_ALLOC_HOST_PTR on buffer for writing output (BufferBandwidth SDK sample)
          sudarshan

          Hi,

          I have gone through AMD's OpenCL memory management documentation as well as BufferBandwidth code. I found no apparent reason why you should be getting such results.

           

          I ran the code on my side (Radeon 7800 series) with your input. I am not able to reproduce the error you are getting. I am attaching the snap-shots of the experiment. (test_small.txt is for buffer size 130000000 and test_big.txt is for buffer size 134000000).

           

          Can you rerun it with latest drivers? Also can you gradually increase buffer size and find where is the fault line?

            • Re: Re: Using CL_MEM_ALLOC_HOST_PTR on buffer for writing output (BufferBandwidth SDK sample)
              ekondis

              Ok, after I had to reinstall a fresh linux distribution (for other reasons) I installed the latest Catalyst 14.4 RC and the problem persists. I performed a binary search and I found that the problem is evident for buffer size 131135488 whereas for 131135487 it is not. Here are the results:

               

              ./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 131135487
              Platform found : Advanced Micro Devices, Inc.
              
              
              
              
              Device  0            Capeverde
              Build:               release
              GPU work items:      12160
              Buffer size:         131133440
              CPU workers:         1
              Timing loops:        20
              Repeats:             1
              Kernel loops:        20
              inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
              outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR
              
              
              
              
              AVERAGES (over loops 2 - 19, use -l for complete log)
              --------
              
              
              
              
              1. Host mapped write to inputBuffer
              ---------------------------------------|---------------
              clEnqueueMapBuffer -- WRITE (GBPS)     | 1.18e+04
              ---------------------------------------|---------------
              memset() (GBPS)                        | 4.3
              ---------------------------------------|---------------
              clEnqueueUnmapMemObject() (GBPS)       | 6.91e+03
              
              
              
              
              2. GPU kernel read of inputBuffer
              ---------------------------------------|---------------
              clEnqueueNDRangeKernel() (GBPS)        | 3.03
              
              
              Verification Passed!
              
              
              
              
              3. GPU kernel write to outputBuffer
              ---------------------------------------|---------------
              clEnqueueNDRangeKernel() (GBPS)        | 1.76
              
              
              
              
              4. Host mapped read of outputBuffer
              ---------------------------------------|---------------
              clEnqueueMapBuffer -- READ (GBPS)      | 1.28e+04
              ---------------------------------------|---------------
              CPU read (GBPS)                        | 4.04
              ---------------------------------------|---------------
              clEnqueueUnmapMemObject() (GBPS)       | 6.69e+03
              
              
              Verification Passed!
              
              
              
              
              Passed!
              ./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 131135488
              Platform found : Advanced Micro Devices, Inc.
              
              
              
              
              Device  0            Capeverde
              Build:               release
              GPU work items:      745088
              Buffer size:         131135488
              CPU workers:         1
              Timing loops:        20
              Repeats:             1
              Kernel loops:        20
              inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
              outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR
              
              
              
              
              AVERAGES (over loops 2 - 19, use -l for complete log)
              --------
              
              
              
              
              1. Host mapped write to inputBuffer
              ---------------------------------------|---------------
              clEnqueueMapBuffer -- WRITE (GBPS)     | 1.39e+04
              ---------------------------------------|---------------
              memset() (GBPS)                        | 4.3
              ---------------------------------------|---------------
              clEnqueueUnmapMemObject() (GBPS)       | 6.67e+03
              
              
              
              
              2. GPU kernel read of inputBuffer
              ---------------------------------------|---------------
              clEnqueueNDRangeKernel() (GBPS)        | 3.03
              
              
              Verification Passed!
              
              
              
              
              3. GPU kernel write to outputBuffer
              ---------------------------------------|---------------
              clEnqueueNDRangeKernel() (GBPS)        | 32.1
              
              
              
              
              4. Host mapped read of outputBuffer
              ---------------------------------------|---------------
              clEnqueueMapBuffer -- READ (GBPS)      | 2.32
              ---------------------------------------|---------------
              CPU read (GBPS)                        | 4.03
              ---------------------------------------|---------------
              clEnqueueUnmapMemObject() (GBPS)       | 5.47e+03
              
              
              Verification Passed!
              
              
              
              
              Passed!