The GCN docs indicate the L1 data cache bandwidth is 64 bytes/clock, but don't provide any details as to how this is arbitrated between the 4 SIMD units in each CU. Is it banked so each SIMD can get 16 bytes per clock, or does one SIMD at a time get a full cache line by some arbitration mechanism?
Hi Ralph,
Here is the suggested response from the relevant team:
“One wavefront is serviced at a time (over some number of clocks), so it’s best if wavefronts fetch one or more entire cachelines to get peak L1$ bandwidth.”
Regards,
Does "some number of clocks", mean a variable number, or is it a fixed number (say 4) and they just aren't being specific?
If it's variable, does that mean all (up to 64) memory reads that have returned from the L2 will be serviced before another wavefront is serviced?
In other words, if a wavefront has executed FLAT_LOAD_DWORD, and each of the 64 threads is loading from a different random address in memory, will it take 64 continuous cycles to service that wavefront?
The number of clocks may vary depending on the locality of the requested data.
Given there is very little control of placement or scheduling, may I know how the above information would be helpful to you? Understanding your requirement might help the related team to provide the needed information.
Regards,
I'm writing code that is limited by the memory bandwidth. I can get to around 90% of the GDDR5 memory bus bandwidth, but I think I should be able to get to around 95% (100% is impossible due to bandwidth consumed by refresh). I've written a memory bandwidth test program that shows this effect: