Archives Discussions

ekondis · ‎08-29-2015

I run the BufferBandwidth example from the AMD APP SDK 3.0 on a A6-1450 APU. The results follow:

Platform 0 : Advanced Micro Devices, Inc.
Platform found : Advanced Micro Devices, Inc.
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : Kalindi Device ID is 0x232d320
Build Options are : -cl-std=CL2.0
Build:               release
GPU work items:      2048
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        20
inputBuffer:         CL_MEM_READ_ONLY
outputBuffer:        CL_MEM_WRITE_ONLY
inputSVMBuffer:      CL_MEM_READ_ONLY
outputSVMBuffer:     CL_MEM_WRITE_ONLY
Host baseline (naive):
Timer resolution     1004.72 ns
Page fault           3564.09 ns
CPU read             2.83 GB/s
memcpy()             1.44 GB/s
memset(,1,)          1.30 GB/s
memset(,0,)          1.29 GB/s
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS)     | 1024.390
---------------------------------------|---------------
memset() (GBPS)                        | 2.401
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS)       | 2.967
2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (secs)        | 0.092
Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (secs)        | 0.146
4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS)      | 2.555
---------------------------------------|---------------
CPU read (GBPS)                        | 3.290
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS)       | 461.033
Verification Passed!
5. Host mapped write to inputSVMBuffer
---------------------------------------|---------------
clEnqueueSVMMap -- WRITE (GBPS)        | 1935.760
---------------------------------------|---------------
memset() (GBPS)                        | 2.435
---------------------------------------|---------------
clEnqueueSVMUnmap() (GBPS)             | 2.967
6. GPU kernel execution using inputSVMBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (secs)        | 0.092
Verification Passed!
7. GPU kernel write to outputSVMBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (secs)        | 0.146
8. Host mapped read of outputSVMBuffer
---------------------------------------|---------------
clEnqueueSVMMap -- READ (GBPS)         | 2.547
---------------------------------------|---------------
CPU read (GBPS)                        | 3.427
---------------------------------------|---------------
clEnqueueSVMUnmap() (GBPS)             | 479.160
Verification Passed!
Passed!

I have a question regarding the SVM. I noticed that the memory mapping operations tend to copy data from device to host memory or vice versa when mapped for reading or when upmapped for writing respectively. The numbers in GB/sec of the map and unmap operations lead to this conclusion. While this seems reasonable I find it inappropriate for a memory buffer that it is considered as shared virtual memory (SVM). It meets the standard but when one uses an SVM buffer he expects that the physical place of memory is either residing in device or in host memory. I really don't like the fact that memory mapping or unmapping of SVM costs so much. For instance, consider a case where one has to perform a binary search on a tree that it is stored in an SVM buffer. The map operation would copy the entire buffer where only a tiny part of it (some nodes) would actually be needed. And you cannot know in advance which addresses would be needed to be accessed. This would be very inefficient.

tzachi_cohen · ‎09-01-2015

You don't have to specify 'CL_MEM_SVM_FINE_GRAIN_BUFFER' on clSVMAlloc to have true physical share on supported platforms.

'CL_MEM_USE_PERSISTENT_MEM_AMD' is not supported with 'clSVMAlloc'.

However, bear in mind that on APUs like Carrizo the term "local GPU memory" is moot since there is only one type of memory: DDR3 system memory and it is equally accessible to CPU and GPU alike.

On Kaveri the picture is a little bit more complicated, there are two buses to DDR3, named Onion and Garlic. Onion is CPU cache coherent and a little slower than Garlic. More on this architecture can be found here.

With Carrizo, the fine circuit engineers of AMD solved this issue and merged both buses to a single bus which is CPU and GPU cache coherent and can fully saturate DDR3.

View solution in original post

tzachi_cohen · ‎08-30-2015

Hi Ekondis,

With Kaveri and Carrizo on Linux we support true physical share .

Tzachi

ekondis · ‎08-30-2015

1) Coarse grained SVM buffers are physically shared on kaveri & carrizo, or is a fine grained SVM buffer required?

2) Can the CL_MEM_USE_PERSISTENT_MEM_AMD be combined with clSVMAlloc in order to have a zero copy device resident SVM buffer?

tzachi_cohen · ‎09-01-2015