Archives Discussions

ekondis · ‎04-19-2014

Hello,

I'm experimenting on using host allocated memory buffers for reading or writing data. Specifically I use the BufferBandwidth sample provided with the SDK under 64bit Linux on a HD7750 GPU. If I define small enough buffers the program runs as expected. However, if I use somewhat larger array data the output buffer seems not to be host allocated any more.

Here is the output of array size 130000000 where the output seems normal:

./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 130000000

Platform found : Advanced Micro Devices, Inc.

Device 0 Capeverde

Build: release

GPU work items: 290176

Buffer size: 129998848

CPU workers: 1

Timing loops: 20

Repeats: 1

Kernel loops: 20

inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR

outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 7.27e+03

---------------------------------------|---------------

memset() (GBPS) | 4.29

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 6.94e+03

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 3.03

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 1.77

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 7.29e+03

---------------------------------------|---------------

CPU read (GBPS) | 4.03

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 5.97e+03

Verification Passed!

And here is the output when using array size 134000000 where the output buffer seems to be allocated in device memory (high device writing bandwidth, slow when mapped by CPU):

./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 134000000

Platform found : Advanced Micro Devices, Inc.

Device 0 Capeverde

Build: release

GPU work items: 11648

Buffer size: 133998592

CPU workers: 1

Timing loops: 20

Repeats: 1

Kernel loops: 20

inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR

outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 6.25e+03

---------------------------------------|---------------

memset() (GBPS) | 4.29

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 7.13e+03

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 3.02

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 31.5

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 2.46

---------------------------------------|---------------

CPU read (GBPS) | 4.03

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 5.39e+03

Verification Passed!

Passed!

sudarshan · ‎04-22-2014

Hi,

I have gone through AMD's OpenCL memory management documentation as well as BufferBandwidth code. I found no apparent reason why you should be getting such results.

I ran the code on my side (Radeon 7800 series) with your input. I am not able to reproduce the error you are getting. I am attaching the snap-shots of the experiment. (test_small.txt is for buffer size 130000000 and test_big.txt is for buffer size 134000000).

Can you rerun it with latest drivers? Also can you gradually increase buffer size and find where is the fault line?

ekondis · ‎04-23-2014

Ok, after I had to reinstall a fresh linux distribution (for other reasons) I installed the latest Catalyst 14.4 RC and the problem persists. I performed a binary search and I found that the problem is evident for buffer size 131135488 whereas for 131135487 it is not. Here are the results:

./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 131135487

Platform found : Advanced Micro Devices, Inc.

Device 0 Capeverde

Build: release

GPU work items: 12160

Buffer size: 131133440

CPU workers: 1

Timing loops: 20

Repeats: 1

Kernel loops: 20

inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR

outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 1.18e+04

---------------------------------------|---------------

memset() (GBPS) | 4.3

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 6.91e+03

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 3.03

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 1.76

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 1.28e+04

---------------------------------------|---------------

CPU read (GBPS) | 4.04

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 6.69e+03

Verification Passed!

Passed!

./BufferBandwidth -if 0 -if 5 -of 1 -of 5 -db -nb 131135488

Platform found : Advanced Micro Devices, Inc.

Device 0 Capeverde

Build: release

GPU work items: 745088

Buffer size: 131135488

CPU workers: 1

Timing loops: 20

Repeats: 1

Kernel loops: 20

inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR

outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 1.39e+04

---------------------------------------|---------------

memset() (GBPS) | 4.3

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 6.67e+03

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 3.03

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 32.1

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 2.32

---------------------------------------|---------------

CPU read (GBPS) | 4.03

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 5.47e+03

Verification Passed!

Passed!

Archives Discussions

Using CL_MEM_ALLOC_HOST_PTR on buffer for writing output (BufferBandwidth SDK sample)