AnsweredAssumed Answered

Fusion: LDS vs. Global memory performance

Question asked by rspringer on Mar 27, 2012
Latest reply on Mar 28, 2012 by pesh

I've been seeing odd results in my programs, and I finally got around to running the benchmarks provided with the AMD APP package, and they have similar behavior. It appears that reading from global memory is faster (greater bandwidth) than reading from LDS, and I would like to understand why.

 

Here are the results from running the test on our test machine (A8-3850 APU):

Global memory test:

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : BeaverCreek Device ID is 0x14788f0

Build Options are : -D DATATYPE=float4 -D OFFSET=16384

 

Global Memory Read

AccessType          : single

VectorElements          : 4

Bandwidth          : 377.067 GB/s

 

Global Memory Read

AccessType          : linear

VectorElements          : 4

Bandwidth          : 218.901 GB/s

 

Global Memory Read

AccessType          : linear(uncached)

VectorElements          : 4

Bandwidth          : 27.6182 GB/s

 

Global Memory Write

AccessType          : linear

VectorElements          : 4

Bandwidth          : 63.4831 GB/s

 

 

Local memory test:

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

 

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : BeaverCreek Device ID is 0xa7e8f0

Build Options are : -D DATATYPE=float2

 

AccessType          : single

VectorElements          : 2

Bandwidth          : 283.341 GB/s

 

AccessType          : linear

VectorElements          : 2

Bandwidth          : 189.003 GB/s

 

Can someone explain why this is the case/why these results make sense? I was under the impression that LDS offered > 10x the bandwidth of global memory. Is that only taken across all CUs (...but that still wouldn't add up - 283*5 still isn't > 377 * 10)?

 

Thanks for any help!

Outcomes