Hello,
I'm experimenting on using host allocated memory buffers for reading or writing data. Specifically I use the BufferBandwidth sample provided with the SDK under 64bit Linux on a HD7750 GPU. If I define small enough buffers the program runs as expected. However, if I use somewhat larger array data the output buffer seems not to be host allocated any more.
Here is the output of array size 130000000 where the output seems normal:
./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 130000000
Platform found : Advanced Micro Devices, Inc.
Device 0 Capeverde
Build: release
GPU work items: 290176
Buffer size: 129998848
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 7.27e+03
---------------------------------------|---------------
memset() (GBPS) | 4.29
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 6.94e+03
2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 3.03
Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 1.77
4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 7.29e+03
---------------------------------------|---------------
CPU read (GBPS) | 4.03
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 5.97e+03
Verification Passed!
And here is the output when using array size 134000000 where the output buffer seems to be allocated in device memory (high device writing bandwidth, slow when mapped by CPU):
./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 134000000
Platform found : Advanced Micro Devices, Inc.
Device 0 Capeverde
Build: release
GPU work items: 11648
Buffer size: 133998592
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 6.25e+03
---------------------------------------|---------------
memset() (GBPS) | 4.29
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 7.13e+03
2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 3.02
Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 31.5
4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 2.46
---------------------------------------|---------------
CPU read (GBPS) | 4.03
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 5.39e+03
Verification Passed!
Passed!
Hi,
I have gone through AMD's OpenCL memory management documentation as well as BufferBandwidth code. I found no apparent reason why you should be getting such results.
I ran the code on my side (Radeon 7800 series) with your input. I am not able to reproduce the error you are getting. I am attaching the snap-shots of the experiment. (test_small.txt is for buffer size 130000000 and test_big.txt is for buffer size 134000000).
Can you rerun it with latest drivers? Also can you gradually increase buffer size and find where is the fault line?
Ok, after I had to reinstall a fresh linux distribution (for other reasons) I installed the latest Catalyst 14.4 RC and the problem persists. I performed a binary search and I found that the problem is evident for buffer size 131135488 whereas for 131135487 it is not. Here are the results:
./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 131135487
Platform found : Advanced Micro Devices, Inc.
Device 0 Capeverde
Build: release
GPU work items: 12160
Buffer size: 131133440
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 1.18e+04
---------------------------------------|---------------
memset() (GBPS) | 4.3
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 6.91e+03
2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 3.03
Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 1.76
4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 1.28e+04
---------------------------------------|---------------
CPU read (GBPS) | 4.04
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 6.69e+03
Verification Passed!
Passed!
./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 131135488
Platform found : Advanced Micro Devices, Inc.
Device 0 Capeverde
Build: release
GPU work items: 745088
Buffer size: 131135488
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 1.39e+04
---------------------------------------|---------------
memset() (GBPS) | 4.3
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 6.67e+03
2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 3.03
Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 32.1
4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 2.32
---------------------------------------|---------------
CPU read (GBPS) | 4.03
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 5.47e+03
Verification Passed!
Passed!