cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dmeiser
Elite

bandwidth of texture units in gcn

Hi,

I've been trying to find the theoretical peak bandwidth of the texture units on radeon 7970 gpus but haven't been able to come up with something.  Could somebody point to a spec sheet that has that number?

The reason why I ask is that in the past it was often recommended to employ the texture units for reading from global memory for bandwidth limited kernels (e.g. on a 5870 the aggregate texture bandwidth is ~1TB/s, about 5 times more than bandwidth from global memory).  Is this still a recommended optimization for gcn gpus?

Thanks,

Dominic

0 Likes
1 Solution

It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).

View solution in original post

0 Likes
6 Replies

It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).

0 Likes

MicahVillmow wrote:

It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).

That's very interesting, do you have a reference to this somewhere in the documentation? I would have thought that the texture memory (hardware based memory optimized for 2D/image accesses, correct?) would be faster in most/all situations.

0 Likes

cached global memory and images both go through the same hardware. So as long as you don't cause bank conflicts in your access pattern and get caching on your pointers, there should be no performance delta.

Did the memory subsystem change with GCN, or has this been around since Evergreen? I thought global memory was never cached, unless you use constant/texture memory.

0 Likes

Global memory caching support was added starting in SDK 2.3 and has been improved since then. Now we have caching optimization's that are enabled by default based on the kernel without user intervention. On GCN, unlike evergreen, hardware caching is the default, so it requires even less support to get higher bandwidth.

Dear Micah,

Thanks a lot for your answer. That makes sense.

Let me ask a followup question.  My original question was with the following paper in mind:

http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Programme_files/fastgemm.pdf

In that paper they construct a kernel for dense matrix-matrix multiplication. In the discussion of the algorithm the author states that on the cypress architecture each texture unit has a bandwidth of about 50GB/s leading to an aggregate bandwidth of about 1TB/s for a 5870. Is this bandwidth only achieved if the texture units can exploit a lot of data reuse (which is possible in the dense matrix-matrix kernel they discuss)? Can the same caching be done on more recent hardware by doing "normal" accesses to global memory?

Cheers,

Dominic

Edit: This was already answered in the previous post by Micah while I typed. Thanks again.

0 Likes