cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cadorino
Journeyman III

Strange global memory bandwidth for HD 7970

Hi to all,

I run the GlobalMemoryBandwidth sample on a AMD 7970 and the results is the following:

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : Tahiti Device ID is 005EF838

Build Options are : -D DATATYPE=float4 -D OFFSET=16384

Global Memory Read

AccessType      : single

VectorElements  : 4

Bandwidth       : 1672.16 GB/s

Global Memory Read

AccessType      : linear

VectorElements  : 4

Bandwidth       : 1756.49 GB/s

Global Memory Read

AccessType      : linear(uncached)

VectorElements  : 4

Bandwidth       : 220.769 GB/s

Global Memory Write

AccessType      : linear

VectorElements  : 4

Bandwidth       : 501.668 GB/s

The question is: how can the linear read bandwidth be over 1 terabyte per second is the max theoretical bandwidth is about 260 GB/s?

The question that naturally follows is: are global buffer reads cached?

Thank you very much!

0 Likes
1 Solution

No, that's a cache hit.

A typical cache contention situation would be the following:

-wavefront 1 reads data from global and brings it into cache

-subsequently wavefront 2 reads data from global, brings it into cache and evicts wavefront 1's data from cache

-then wavefront 1 will have to read again from global memory (its data is no longer in cache).

If only one wavefront is running on a compute unit the second read by wavefront 1 would be cached.

If you anticipate that cache contentions like that are a performance limitation in your code you can limit the number of wavefronts executing on any CU by allocating so much shared memory that only one wavefront fits on a CU.

On the other hand, when lots of wavefronts access the same data you can get huge effective bandwidth gains by having many wavefronts execute on the same CU. Many of their reads will then be cached.

View solution in original post

0 Likes
10 Replies
nou
Exemplar

yes it is cached.

0 Likes

Is tahiti the first architecture that caches global memory in addition to texture memory?

0 Likes

tahiti L1 cached read data and write data of global memory. before tahiti, L1 can't cache write data.

0 Likes

Another question
How is the L1 cache partitioned among the wavefronts/work items scheduled on a CU?

0 Likes

one L1 per CU, so one wavefront(64 work item),  one L1 cache, for different workitem fetch data from global memory, I think it uses coalescene read/write, every time, bring data for 16 work item, these data can be cached in L1.

0 Likes

Well, but is the entire L1 shared between the working groups scheduled on a CU or it is partitioned among them?
Thank you!

0 Likes

yes, there is cache contention between different wavefronts scheduled on a single cu.

0 Likes

So if two items in two different wavefronts accesses the same buffer element two different pages are transferred into the L1 cache. Right? (or better, the same page is transferred twice)

0 Likes

No, that's a cache hit.

A typical cache contention situation would be the following:

-wavefront 1 reads data from global and brings it into cache

-subsequently wavefront 2 reads data from global, brings it into cache and evicts wavefront 1's data from cache

-then wavefront 1 will have to read again from global memory (its data is no longer in cache).

If only one wavefront is running on a compute unit the second read by wavefront 1 would be cached.

If you anticipate that cache contentions like that are a performance limitation in your code you can limit the number of wavefronts executing on any CU by allocating so much shared memory that only one wavefront fits on a CU.

On the other hand, when lots of wavefronts access the same data you can get huge effective bandwidth gains by having many wavefronts execute on the same CU. Many of their reads will then be cached.

0 Likes
registerme
Journeyman III

Per my understanding, you should use the L1 cache bandwidth instead of the global memory bandwidth. When it's cached, it's much faster.

0 Likes