Hi to all,
I run the GlobalMemoryBandwidth sample on a AMD 7970 and the results is the following:
Platform 0 : Advanced Micro Devices, Inc.
Platform found : Advanced Micro Devices, Inc.
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : Tahiti Device ID is 005EF838
Build Options are : -D DATATYPE=float4 -D OFFSET=16384
Global Memory Read
AccessType : single
VectorElements : 4
Bandwidth : 1672.16 GB/s
Global Memory Read
AccessType : linear
VectorElements : 4
Bandwidth : 1756.49 GB/s
Global Memory Read
AccessType : linear(uncached)
VectorElements : 4
Bandwidth : 220.769 GB/s
Global Memory Write
AccessType : linear
VectorElements : 4
Bandwidth : 501.668 GB/s
The question is: how can the linear read bandwidth be over 1 terabyte per second is the max theoretical bandwidth is about 260 GB/s?
The question that naturally follows is: are global buffer reads cached?
Thank you very much!
Solved! Go to Solution.
No, that's a cache hit.
A typical cache contention situation would be the following:
-wavefront 1 reads data from global and brings it into cache
-subsequently wavefront 2 reads data from global, brings it into cache and evicts wavefront 1's data from cache
-then wavefront 1 will have to read again from global memory (its data is no longer in cache).
If only one wavefront is running on a compute unit the second read by wavefront 1 would be cached.
If you anticipate that cache contentions like that are a performance limitation in your code you can limit the number of wavefronts executing on any CU by allocating so much shared memory that only one wavefront fits on a CU.
On the other hand, when lots of wavefronts access the same data you can get huge effective bandwidth gains by having many wavefronts execute on the same CU. Many of their reads will then be cached.
yes it is cached.
Is tahiti the first architecture that caches global memory in addition to texture memory?
tahiti L1 cached read data and write data of global memory. before tahiti, L1 can't cache write data.
Another question
How is the L1 cache partitioned among the wavefronts/work items scheduled on a CU?
one L1 per CU, so one wavefront(64 work item), one L1 cache, for different workitem fetch data from global memory, I think it uses coalescene read/write, every time, bring data for 16 work item, these data can be cached in L1.
Well, but is the entire L1 shared between the working groups scheduled on a CU or it is partitioned among them?
Thank you!
yes, there is cache contention between different wavefronts scheduled on a single cu.
So if two items in two different wavefronts accesses the same buffer element two different pages are transferred into the L1 cache. Right? (or better, the same page is transferred twice)
No, that's a cache hit.
A typical cache contention situation would be the following:
-wavefront 1 reads data from global and brings it into cache
-subsequently wavefront 2 reads data from global, brings it into cache and evicts wavefront 1's data from cache
-then wavefront 1 will have to read again from global memory (its data is no longer in cache).
If only one wavefront is running on a compute unit the second read by wavefront 1 would be cached.
If you anticipate that cache contentions like that are a performance limitation in your code you can limit the number of wavefronts executing on any CU by allocating so much shared memory that only one wavefront fits on a CU.
On the other hand, when lots of wavefronts access the same data you can get huge effective bandwidth gains by having many wavefronts execute on the same CU. Many of their reads will then be cached.
Per my understanding, you should use the L1 cache bandwidth instead of the global memory bandwidth. When it's cached, it's much faster.