Hey guys, I made some tests using the benchmarks in AMD APP SDK v2.4. There is a benchmark called 'GlobalMemoryBandwidth', which is used to test global memory bandwidth using different memory access patterns. There are four memory access patterns in this benchmark: read linear, read linear uncached, read single, and write linear.
When using 'read linear', I found that float1 performs much better than float4, with bandwidth 263 GB/s vs. 155 GB/s. Undoubtedly, float1 is exploiting cache, since the bandwidth is much higher than the theoretical bandwidth 153.6 Gb/s. But why cannot float4 exploit cache? Can anybody tell me the reasons? Thanks a lot.
I am using HD5870, and AMD APP SDK v2.4. The OS is Ubuntu 10.04.
Originally posted by: notzed I'd guess that it's just because float1's are smaller. There's only a very small bit of cache on each CU, e.g. 8k.
If you have 256 work items reading consecutive addresses, this is 1024 bytes for float1, and 4096 bytes for float4. And it will be scheduling more than 1 workgroup per CU.
Even not knowing anything about the cache associativity algorithm, one could still conclude it wont be as effective with float4 simply because it's 4x bigger.
humm, you explanations are reasonable, but not very persivasive. The thing is we should first determine whether it is because of cache. What are the bandwidths when there is no cache effect. Is there any way to disable the cache effect?
The only way to not use the cache would be to make sure every work item does not access data previously/currently accessed by another work item, so L1 & L2 cache are not used, and also sequential reads within a work item are atleast 128bits apart to avoid reads from the same cache line.