cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ekondis
Adept II

SVM physically shared or just virtualy shared?

I run the BufferBandwidth example from the AMD APP SDK 3.0 on a A6-1450 APU. The results follow:

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : Kalindi Device ID is 0x232d320

Build Options are : -cl-std=CL2.0

Build:               release

GPU work items:      2048

Buffer size:         33554432

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLY

outputBuffer:        CL_MEM_WRITE_ONLY

inputSVMBuffer:      CL_MEM_READ_ONLY

outputSVMBuffer:     CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution     1004.72 ns

Page fault           3564.09 ns

CPU read             2.83 GB/s

memcpy()             1.44 GB/s

memset(,1,)          1.30 GB/s

memset(,0,)          1.29 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS)     | 1024.390

---------------------------------------|---------------

memset() (GBPS)                        | 2.401

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)       | 2.967

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (secs)        | 0.092

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (secs)        | 0.146

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS)      | 2.555

---------------------------------------|---------------

CPU read (GBPS)                        | 3.290

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)       | 461.033

Verification Passed!

5. Host mapped write to inputSVMBuffer

---------------------------------------|---------------

clEnqueueSVMMap -- WRITE (GBPS)        | 1935.760

---------------------------------------|---------------

memset() (GBPS)                        | 2.435

---------------------------------------|---------------

clEnqueueSVMUnmap() (GBPS)             | 2.967

6. GPU kernel execution using inputSVMBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (secs)        | 0.092

Verification Passed!

7. GPU kernel write to outputSVMBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (secs)        | 0.146

8. Host mapped read of outputSVMBuffer

---------------------------------------|---------------

clEnqueueSVMMap -- READ (GBPS)         | 2.547

---------------------------------------|---------------

CPU read (GBPS)                        | 3.427

---------------------------------------|---------------

clEnqueueSVMUnmap() (GBPS)             | 479.160

Verification Passed!

Passed!

I have a question regarding the SVM. I noticed that the memory mapping operations tend to copy data from device to host memory or vice versa when mapped for reading or when upmapped for writing respectively. The numbers in GB/sec of the map and unmap operations lead to this conclusion. While this seems reasonable I find it inappropriate for a memory buffer that it is considered as shared virtual memory (SVM). It meets the standard but when one uses an SVM buffer he expects that the physical place of memory is either residing in device or in host memory. I really don't like the fact that memory mapping or unmapping of SVM costs so much. For instance, consider a case where one has to perform a binary search on a tree that it is stored in an SVM buffer. The map operation would copy the entire buffer where only a tiny part of it (some nodes) would actually be needed. And you cannot know in advance which addresses would be needed to be accessed. This would be very inefficient.

0 Likes
1 Solution

You don't have to specify 'CL_MEM_SVM_FINE_GRAIN_BUFFER' on clSVMAlloc to have true physical share on supported platforms.

'CL_MEM_USE_PERSISTENT_MEM_AMD' is not supported with 'clSVMAlloc'.

However, bear in mind that on APUs like Carrizo the term "local GPU memory" is moot since there is only one type of memory: DDR3 system memory and it is equally accessible to CPU and GPU alike.

On Kaveri the picture is a little bit more complicated, there are two buses to DDR3, named Onion and Garlic. Onion is CPU cache coherent and a little slower than Garlic. More on this architecture can be found here.

With Carrizo, the fine circuit engineers of AMD solved this issue and merged both buses to a single bus which is CPU and GPU cache coherent and can fully saturate DDR3.

View solution in original post

4 Replies

Hi Ekondis,

With Kaveri and Carrizo on Linux we support true physical share .

Tzachi

0 Likes

1) Coarse grained SVM buffers are physically shared on kaveri & carrizo, or is a fine grained SVM buffer required?

2) Can the CL_MEM_USE_PERSISTENT_MEM_AMD be combined with clSVMAlloc in order to have a zero copy device resident SVM buffer?

0 Likes

You don't have to specify 'CL_MEM_SVM_FINE_GRAIN_BUFFER' on clSVMAlloc to have true physical share on supported platforms.

'CL_MEM_USE_PERSISTENT_MEM_AMD' is not supported with 'clSVMAlloc'.

However, bear in mind that on APUs like Carrizo the term "local GPU memory" is moot since there is only one type of memory: DDR3 system memory and it is equally accessible to CPU and GPU alike.

On Kaveri the picture is a little bit more complicated, there are two buses to DDR3, named Onion and Garlic. Onion is CPU cache coherent and a little slower than Garlic. More on this architecture can be found here.

With Carrizo, the fine circuit engineers of AMD solved this issue and merged both buses to a single bus which is CPU and GPU cache coherent and can fully saturate DDR3.

Thank you for your comprehensive reply

0 Likes