5 Replies Latest reply on May 23, 2012 2:23 PM by viscocoa

    Question regarding memory access

    viscocoa

      In many circumstances, the band width of the global memory is a bottleneck of performance.

       

      If two or more threads in the same work group read the same data from global memory, at the same time or almost the same time, will the GPU read the data only once, and broadcast the data to all threads requiring it?

       

      What if the threads in DIFFERENT work groups read the same data from global memory?

       

      Thank you in advance!

        • Question regarding memory access
          viscocoa

          This is probably a stupid question, but I want to know it. Any suggestion will be deeply appreciated.

          • Re: Question regarding memory access
            realhet

            Hi,

             

            The second read from the same (or near) location will be a  cache hit, so it will be much faster than the first - possibly uncached - read. This is somewhat simliar behaviour that you mentioned in your question, but produced in an indirect way.

             

            But if there's a way you can optimize this redundant memory IO in your program then you should do it on the software side ofc.

              • Re: Question regarding memory access
                viscocoa

                Hi realhet,

                 

                Thank you very much for your answer. I think you are right. I was also expecting that the cache would do the job. With some experiments, I found that when a large amount of data is read, the cache does not help a lot. Reading the same amount of data from different locations is much faster than from overlapped locations. I think that may be caused by bank conflicts?

                  • Re: Question regarding memory access
                    realhet

                    When using the cache you can do random memory operations, but the accessed memory range must be under the cache boundaries.

                    If you read large amounts linearly then the cache cannot work at all, and memory bandwidth will kick in, and the cache will help in reducing some latency because it will read ahead.

                    Overlapped locations: I guess on the newer cards there is a mechanism that analyzes memory read patterns and intelligently loads the cache from predicted memory locations. If you read in an unpredictable pattern, then this mechanism will not work at all, and your program will have to wait more memory access latencies.

                    1 of 1 people found this helpful
                      • Re: Question regarding memory access
                        viscocoa

                        Hi Realhet,

                         

                        Thank you for your comments.

                         

                        I am reding in a large amount of data, exceeding the capacity of the cache. Even so, I think all data should be cached, allowing threads from the same group to share the fectched data, and then flushed with new data. Consequently, reading overlapping data should be more efficient than sparse data. However, the current experiments show opposite results.