Hi,
I've been trying to find the theoretical peak bandwidth of the texture units on radeon 7970 gpus but haven't been able to come up with something. Could somebody point to a spec sheet that has that number?
The reason why I ask is that in the past it was often recommended to employ the texture units for reading from global memory for bandwidth limited kernels (e.g. on a 5870 the aggregate texture bandwidth is ~1TB/s, about 5 times more than bandwidth from global memory). Is this still a recommended optimization for gcn gpus?
Thanks,
Dominic
Solved! Go to Solution.
It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).
It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).
MicahVillmow wrote:
It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).
That's very interesting, do you have a reference to this somewhere in the documentation? I would have thought that the texture memory (hardware based memory optimized for 2D/image accesses, correct?) would be faster in most/all situations.
cached global memory and images both go through the same hardware. So as long as you don't cause bank conflicts in your access pattern and get caching on your pointers, there should be no performance delta.
Did the memory subsystem change with GCN, or has this been around since Evergreen? I thought global memory was never cached, unless you use constant/texture memory.
Global memory caching support was added starting in SDK 2.3 and has been improved since then. Now we have caching optimization's that are enabled by default based on the kernel without user intervention. On GCN, unlike evergreen, hardware caching is the default, so it requires even less support to get higher bandwidth.
Dear Micah,
Thanks a lot for your answer. That makes sense.
Let me ask a followup question. My original question was with the following paper in mind:
http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Programme_files/fastgemm.pdf
In that paper they construct a kernel for dense matrix-matrix multiplication. In the discussion of the algorithm the author states that on the cypress architecture each texture unit has a bandwidth of about 50GB/s leading to an aggregate bandwidth of about 1TB/s for a 5870. Is this bandwidth only achieved if the texture units can exploit a lot of data reuse (which is possible in the dense matrix-matrix kernel they discuss)? Can the same caching be done on more recent hardware by doing "normal" accesses to global memory?
Cheers,
Dominic
Edit: This was already answered in the previous post by Micah while I typed. Thanks again.