cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

davibu
Journeyman III

MatrixMulImage example

Why exactly MatrixMulImage example provided with the new SDK is so much faster than standard MatrixMultiplication ? It is something like 2 or 3 time faster.

It looks like the only difference is in memory access done via new image support. Memory pinning, texture cache, etc, what is the source of the huge performance boost in using image instead of normal memory buffer ?

 

0 Likes
9 Replies
hazeman
Adept II

Short answer is : cache. TU uses cache. Normal memory access doesn't ( in theory it should use vertex cache on 5xxx, but it doesn't or not always or something like that ). Also the example is tailored for a little bit more efficient cache reuse.

0 Likes

Btw, what is performance in gflops of these functions? I assume it is single precision, what is about double precision, multiplication and addtiion are supported now.

0 Likes

On 5870 single precision ~ 1 TFLOPs . Double precision is probably ~200-300 GFLOPs. More on this is here http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963&enterthread=y .

0 Likes

Maybe Teraflop?

0 Likes

Yep. My mistake .

0 Likes

Thanks. lol

0 Likes

Originally posted by: hazeman Short answer is : cache. TU uses cache. Normal memory access doesn't ( in theory it should use vertex cache on 5xxx, but it doesn't or not always or something like that ). Also the example is tailored for a little bit more efficient cache reuse.

 

 

Thanks, Hazeman, this explain a lot.

 

From the thread you linked, it looks like it is more an issue related to what is stored in the cache and/or memory layout than having or not a cache enabled (i.e. tiled format Vs linear format).

 

 

0 Likes

From the thread you linked, it looks like it is more an issue related to what is stored in the cache and/or memory layout than having or not a cache enabled (i.e. tiled format Vs linear format).


Without cache memory transfer is ~150GB/s ( 5870 ). Transfer from cache is ~50GB/s for 1 simd ( 20*50GB/s totall ). So whether the cache is used or not is quite important. Even if you don't hit much on cache it gives advantage.

Now the talk about layout was about squizing more performance. MatMul example is doing ~1 TFLOPs - but with perfect optimizations it should be possible to get >2 TFLOPs ( this has been achived but not with opencl ). In the thread I've linked I was suggesting what factors they might have missed ( memory/thread layout ).

But layout thing isn't a goal in itself. It's only a method to achive high cache reuse, so we can as often as possible have 1TB/s transfer. I doubt that tiled format is optimal for matmult but it's probably better than linear.

0 Likes

The standard MatrixMultiplication sample uses LDS instead of Texture Cache. For 1 SIMD Max theoritical bandwidth from LDS is 108.8 GB/s which is exactly double than that of L1 TC = 54.4 GB/s.

But the actual performance is very less as maximum benchmarked bandwidth from LDS is only around 850 GB/s on 5870 (should be 2 TB/s according to theoritical max) and that too for only single-component data type, using vec4 reads from LDS gives much lesser bandwidth which the sample uses.

0 Likes