6 Replies Latest reply on Feb 23, 2012 11:37 AM by dmeiser

    bandwidth of texture units in gcn

    dmeiser

      Hi,

       

      I've been trying to find the theoretical peak bandwidth of the texture units on radeon 7970 gpus but haven't been able to come up with something.  Could somebody point to a spec sheet that has that number?

       

      The reason why I ask is that in the past it was often recommended to employ the texture units for reading from global memory for bandwidth limited kernels (e.g. on a 5870 the aggregate texture bandwidth is ~1TB/s, about 5 times more than bandwidth from global memory).  Is this still a recommended optimization for gcn gpus?

       

      Thanks,

      Dominic

        • Re: bandwidth of texture units in gcn
          MicahVillmow

          It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).

            • Re: bandwidth of texture units in gcn
              notyou

              MicahVillmow wrote:

               

              It is not recommended to use images unless you require the actual image functionality. Global memory can achieve the same performance as long as certain constraints are observed on Evergreen and later hardware(don't use atomics/byte/short operations on the same pointers that are the memory bottleneck, etc...).

              That's very interesting, do you have a reference to this somewhere in the documentation? I would have thought that the texture memory (hardware based memory optimized for 2D/image accesses, correct?) would be faster in most/all situations.

              • Re: bandwidth of texture units in gcn
                dmeiser

                Dear Micah,

                 

                Thanks a lot for your answer. That makes sense.

                 

                Let me ask a followup question.  My original question was with the following paper in mind:

                 

                http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Programme_files/fastgemm.pdf

                 

                In that paper they construct a kernel for dense matrix-matrix multiplication. In the discussion of the algorithm the author states that on the cypress architecture each texture unit has a bandwidth of about 50GB/s leading to an aggregate bandwidth of about 1TB/s for a 5870. Is this bandwidth only achieved if the texture units can exploit a lot of data reuse (which is possible in the dense matrix-matrix kernel they discuss)? Can the same caching be done on more recent hardware by doing "normal" accesses to global memory?

                 

                Cheers,

                Dominic

                 

                Edit: This was already answered in the previous post by Micah while I typed. Thanks again.