In many circumstances, the band width of the global memory is a bottleneck of performance.
If two or more threads in the same work group read the same data from global memory, at the same time or almost the same time, will the GPU read the data only once, and broadcast the data to all threads requiring it?
What if the threads in DIFFERENT work groups read the same data from global memory?
Thank you in advance!
Solved! Go to Solution.
Hi,
The second read from the same (or near) location will be a cache hit, so it will be much faster than the first - possibly uncached - read. This is somewhat simliar behaviour that you mentioned in your question, but produced in an indirect way.
But if there's a way you can optimize this redundant memory IO in your program then you should do it on the software side ofc.
This is probably a stupid question, but I want to know it. Any suggestion will be deeply appreciated.
Hi,
The second read from the same (or near) location will be a cache hit, so it will be much faster than the first - possibly uncached - read. This is somewhat simliar behaviour that you mentioned in your question, but produced in an indirect way.
But if there's a way you can optimize this redundant memory IO in your program then you should do it on the software side ofc.
Hi realhet,
Thank you very much for your answer. I think you are right. I was also expecting that the cache would do the job. With some experiments, I found that when a large amount of data is read, the cache does not help a lot. Reading the same amount of data from different locations is much faster than from overlapped locations. I think that may be caused by bank conflicts?
When using the cache you can do random memory operations, but the accessed memory range must be under the cache boundaries.
If you read large amounts linearly then the cache cannot work at all, and memory bandwidth will kick in, and the cache will help in reducing some latency because it will read ahead.
Overlapped locations: I guess on the newer cards there is a mechanism that analyzes memory read patterns and intelligently loads the cache from predicted memory locations. If you read in an unpredictable pattern, then this mechanism will not work at all, and your program will have to wait more memory access latencies.
Hi Realhet,
Thank you for your comments.
I am reding in a large amount of data, exceeding the capacity of the cache. Even so, I think all data should be cached, allowing threads from the same group to share the fectched data, and then flushed with new data. Consequently, reading overlapping data should be more efficient than sparse data. However, the current experiments show opposite results.