cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rspringer
Journeyman III

Fusion: LDS vs. Global memory performance

I've been seeing odd results in my programs, and I finally got around to running the benchmarks provided with the AMD APP package, and they have similar behavior. It appears that reading from global memory is faster (greater bandwidth) than reading from LDS, and I would like to understand why.

Here are the results from running the test on our test machine (A8-3850 APU):

Global memory test:

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : BeaverCreek Device ID is 0x14788f0

Build Options are : -D DATATYPE=float4 -D OFFSET=16384

 

Global Memory Read

AccessType          : single

VectorElements          : 4

Bandwidth          : 377.067 GB/s

 

Global Memory Read

AccessType          : linear

VectorElements          : 4

Bandwidth          : 218.901 GB/s

Global Memory Read

AccessType          : linear(uncached)

VectorElements          : 4

Bandwidth          : 27.6182 GB/s

Global Memory Write

AccessType          : linear

VectorElements          : 4

Bandwidth          : 63.4831 GB/s

Local memory test:

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : BeaverCreek Device ID is 0xa7e8f0

Build Options are : -D DATATYPE=float2

 

AccessType          : single

VectorElements          : 2

Bandwidth          : 283.341 GB/s

AccessType          : linear

VectorElements          : 2

Bandwidth          : 189.003 GB/s

Can someone explain why this is the case/why these results make sense? I was under the impression that LDS offered > 10x the bandwidth of global memory. Is that only taken across all CUs (...but that still wouldn't add up - 283*5 still isn't > 377 * 10)?

Thanks for any help!

0 Likes
9 Replies
rspringer
Journeyman III

Thinking about the results, the reported bandwidth for global memory must be using L1 / L2 somehow, b/c that's unrealistically large. Is the uncached value (of 27.6 GB/s) the one to used for comparisons?

0 Likes

It would be a comparison to uncached bandwidth. L1 bandwidth is about half of peak LDS bandwidth, if I recall correctly. You have to carefully construct a program to hit either L1 or LDS peak, though.

Cool - that makes a bunch more sense.

If L1 bandwidth is about 1/2 of LDS bandwidth, then, do you know why we see global memory read times 3/2 those of LDS read times (in the samples/opencl/benchmark/{GlobalMemoryBandwidth,LDSBandwidth} examples)?

On the surface of it, I'd expect, with a local read time of ~280 GB/s, to see a global read time of ~140 GB/s, not ~370. I'm sure there's just one more thing I'm missing...

I really appreciate the help so far!

0 Likes

At least on the discreet GPUs, LDS uses 32 bit lines while cache uses 128 bit IIRC. Try rerunning the benchmarks using -c 1.

Here are the results of doing so on Tahiti:

4:

LDS:

AccessType    : single

VectorElements    : 4

Bandwidth    : 3381.26 GB/s

AccessType    : linear

VectorElements    : 4

Bandwidth    : 933.717 GB/s

Global

Global Memory Read

AccessType    : single

VectorElements    : 4

Bandwidth    : 1775.8 GB/s

Global Memory Read

AccessType    : linear

VectorElements    : 4

Bandwidth    : 1696.36 GB/s

Global Memory Read

AccessType    : linear(uncached)

VectorElements    : 4

Bandwidth    : 228.799 GB/s

Global Memory Write

AccessType    : linear

VectorElements    : 4

Bandwidth    : 501.616 GB/s

 

1:

Local:

AccessType    : single

VectorElements    : 1

Bandwidth    : 2831.88 GB/s

AccessType    : linear

VectorElements    : 1

Bandwidth    : 3603.81 GB/s

Global

Global Memory Read

AccessType    : single

VectorElements    : 1

Bandwidth    : 847.404 GB/s

Global Memory Read

AccessType    : linear

VectorElements    : 1

Bandwidth    : 828.229 GB/s

Global Memory Read

AccessType    : linear(uncached)

VectorElements    : 1

Bandwidth    : 1284.95 GB/s <<<<-----------clearly a bug in the benchmark. I think you need to make the buffer bigger?

Global Memory Write

AccessType    : linear

VectorElements    : 1

Bandwidth    : 461.21 GB/s

As you can see, you get more cache bandwidth with 4 components, but more LDS bandwidth with 1 component.

This is fascinating to me - thanks a ton for posting your results.

I had forgotten about the different widths of LDS vs. cache - thanks for that (it's hard to keep all relevant performance parameters in mind ).

Focusing solely on single-element read times, it's strange to me how different (in relative speeds, not in absolute values) the results are between your Tahiti and my Llano/BeaverCreek:

For LDS, with the following floatN types, here are my measured bandwidths:

1: 312 GB/s

2: 283 GB/s

4: 333 GB/s

Whereas for global/L1, here's what I see:

1: 339 GB/s

2: 370 GB/s

4: 377 GB/s

At this point, I think I understand what to expect on most devices (and why), but I am still curious why L1/global dominates LDS in all [tested] cases on this device. Are the Fusion APUs (Llano) internally arranged (WRT L1/LDS) differently than other devices?

Again, thanks all for your help so far! I hope this line of question isn't getting tiresome - I'm honestly just curious about these results at this point.

0 Likes
pesh
Adept I

Hi, rspringer.

I think this is not actually the Global Memory bandwidth benchmark in the first two tests (where access type is single and linear). It's better to say this is L1 cache bandwidth tests. Because, if we will look on kernels code, there is a good and very large cache hint. So sequential work-items will read from cached memory instead of global, that's why they will have very large fetch speed. Moreover, practically the same memory region will be read by sequential work-items, and compute unit won't need to access global memory next time, and will just get bytes from cache. Another big difference between L1 cache (or LDS) and global memory is if work-items that simultaneously execute memory read command access the same address, then in case of global memory there will be a bank conflict and all accesses will serialize and in case of L1 cache (or LDS) it will be 1 access that will be broadcast across work-items. That's why test with single access pattern have even more performance than linear access pattern test (if it was global memory it should have terrible results because of serialized memory access to the same bank).

Besides, there is no GDDR that can provide 300 GB/s bandwidth, and especially 1500 Gb/s like rick.weber have. It's physically impossible. Modern GDDR have about 150 GB/s bandwidth. Even using cache you can't read from memory with speed more than this. Cache only helps to utilize memory bandwidth and reuse data more efficiently. Moreover AMD APU use unified memory model, where CPU and GPU shares the same DDR, so in any case APU owners are limited by DDR3 bandwidth that is maximum 34 GB/s with the most powerful 2 channel DDS3

0 Likes

I don't think there's any question that the benchmark is using the cache.

However - it has been stated and is documented that the LDS has *twice* the bandwidth of the L1 cache.

As such I find these results curious - of course it is difficult to create micro-benchmarks that do exactly what you think they do - but the result does seem odd.  I'm not sure the all-single or all-float4 test is checking the bandwidth correctly either.  L1 (as per the programming guide) is optimal at float4 size, but LDS at float2 (afaict).  That's the problem with theoretical bandwidth testing, it depends very much on the data sizes and access patterns.

However for example the programming guide states that for cypress, the LDS is limited to 1TB/s even though it's capable of 2TB/s - but that info may be out of date(?), that section appears aged (talking about sdk 2.2, etc.

...

However: I do think whilst the discussion is interesting, in practicality it isn't that useful: in either case one always has to go to global memory at some point, and it is impossible to get 100% cache hit rate.  And if your algorithm has a known (and compatible) data-access pattern using LDS will win simply because one has more of it and it is local to the work-group (so you have ~guaranteed access performance, no cache thrashing, etc).

0 Likes

L1 on Evergreen could do 1280 bytes/clock. 64 bytes/clock/core. 4 bytes/clock/active work item

LDS on Evergreen could do 2560 bytes/clock. 128 bytes/clock/core. 8 bytes/clock/active work item

Obviously that is peak, and relied on the compiler generating 64-bit reads from LDS in the right layout. 64-bit reads is the way to achieve peak LDS bandwidth (not 128-bit, that's a fast route to bank conflicts). Cache may be more forgiving, though I imagine the cache interface also has bank conflicts on reading from a cache line.

Cayman is, I think, pretty similar if you scale up by 20% for the extra four cores. Tahiti will scale again but I haven't looked into the LDS or L1 interface changes resulting from the restructuring of the SIMD core.

0 Likes

Sorry, I just wanted to say, that GlobalMemoryBandwidth benchmark isn't fair global memory benchmark, they should call it something like L1CacheMemoryBandwidth (=... Simple kernel where 1 work-item read 1 data element from global memory and no broadcast and data reuse occur (even cache is used for example to predict next memory access) will demonstrate global memory performance more clearly...

Now more important thing. SDK 2.6 documentation says:

"The theoretical LDS peak bandwidth is 2 TB/s, compared to L1 at 1 TB/sec. Currently, OpenCL is limited to 1 TB/sec LDS bandwidth."

I think it explains curious results.

0 Likes