Why exactly MatrixMulImage example provided with the new SDK is so much faster than standard MatrixMultiplication ? It is something like 2 or 3 time faster.
It looks like the only difference is in memory access done via new image support. Memory pinning, texture cache, etc, what is the source of the huge performance boost in using image instead of normal memory buffer ?
Short answer is : cache. TU uses cache. Normal memory access doesn't ( in theory it should use vertex cache on 5xxx, but it doesn't or not always or something like that ). Also the example is tailored for a little bit more efficient cache reuse.
Btw, what is performance in gflops of these functions? I assume it is single precision, what is about double precision, multiplication and addtiion are supported now.
On 5870 single precision ~ 1 TFLOPs . Double precision is probably ~200-300 GFLOPs. More on this is here http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963&enterthread=y .
Maybe Teraflop?
Yep. My mistake .
Thanks. lol
Originally posted by: hazeman Short answer is : cache. TU uses cache. Normal memory access doesn't ( in theory it should use vertex cache on 5xxx, but it doesn't or not always or something like that ). Also the example is tailored for a little bit more efficient cache reuse.
Thanks, Hazeman, this explain a lot.
From the thread you linked, it looks like it is more an issue related to what is stored in the cache and/or memory layout than having or not a cache enabled (i.e. tiled format Vs linear format).
From the thread you linked, it looks like it is more an issue related to what is stored in the cache and/or memory layout than having or not a cache enabled (i.e. tiled format Vs linear format).
Without cache memory transfer is ~150GB/s ( 5870 ). Transfer from cache is ~50GB/s for 1 simd ( 20*50GB/s totall ). So whether the cache is used or not is quite important. Even if you don't hit much on cache it gives advantage.
Now the talk about layout was about squizing more performance. MatMul example is doing ~1 TFLOPs - but with perfect optimizations it should be possible to get >2 TFLOPs ( this has been achived but not with opencl ). In the thread I've linked I was suggesting what factors they might have missed ( memory/thread layout ).
But layout thing isn't a goal in itself. It's only a method to achive high cache reuse, so we can as often as possible have 1TB/s transfer. I doubt that tiled format is optimal for matmult but it's probably better than linear.
The standard MatrixMultiplication sample uses LDS instead of Texture Cache. For 1 SIMD Max theoritical bandwidth from LDS is 108.8 GB/s which is exactly double than that of L1 TC = 54.4 GB/s.
But the actual performance is very less as maximum benchmarked bandwidth from LDS is only around 850 GB/s on 5870 (should be 2 TB/s according to theoritical max) and that too for only single-component data type, using vec4 reads from LDS gives much lesser bandwidth which the sample uses.