9 Replies Latest reply on May 8, 2010 11:10 AM by n0thing

    MatrixMulImage example

    davibu

      Why exactly MatrixMulImage example provided with the new SDK is so much faster than standard MatrixMultiplication ? It is something like 2 or 3 time faster.

      It looks like the only difference is in memory access done via new image support. Memory pinning, texture cache, etc, what is the source of the huge performance boost in using image instead of normal memory buffer ?

       

        • MatrixMulImage example
          hazeman

          Short answer is : cache. TU uses cache. Normal memory access doesn't ( in theory it should use vertex cache on 5xxx, but it doesn't or not always or something like that ). Also the example is tailored for a little bit more efficient cache reuse.

            • MatrixMulImage example
              Lev

              Btw, what is performance in gflops of these functions? I assume it is single precision, what is about double precision, multiplication and addtiion are supported now.

              • MatrixMulImage example
                davibu

                 

                Originally posted by: hazeman Short answer is : cache. TU uses cache. Normal memory access doesn't ( in theory it should use vertex cache on 5xxx, but it doesn't or not always or something like that ). Also the example is tailored for a little bit more efficient cache reuse.

                 

                 

                Thanks, Hazeman, this explain a lot.

                 

                From the thread you linked, it looks like it is more an issue related to what is stored in the cache and/or memory layout than having or not a cache enabled (i.e. tiled format Vs linear format).

                 

                 

                  • MatrixMulImage example
                    hazeman

                     

                    From the thread you linked, it looks like it is more an issue related to what is stored in the cache and/or memory layout than having or not a cache enabled (i.e. tiled format Vs linear format).


                    Without cache memory transfer is ~150GB/s ( 5870 ). Transfer from cache is ~50GB/s for 1 simd ( 20*50GB/s totall ). So whether the cache is used or not is quite important. Even if you don't hit much on cache it gives advantage.

                    Now the talk about layout was about squizing more performance. MatMul example is doing ~1 TFLOPs - but with perfect optimizations it should be possible to get >2 TFLOPs ( this has been achived but not with opencl ). In the thread I've linked I was suggesting what factors they might have missed ( memory/thread layout ).

                    But layout thing isn't a goal in itself. It's only a method to achive high cache reuse, so we can as often as possible have 1TB/s transfer. I doubt that tiled format is optimal for matmult but it's probably better than linear.

                      • MatrixMulImage example
                        n0thing

                        The standard MatrixMultiplication sample uses LDS instead of Texture Cache. For 1 SIMD Max theoritical bandwidth from LDS is 108.8 GB/s which is exactly double than that of L1 TC = 54.4 GB/s.

                        But the actual performance is very less as maximum benchmarked bandwidth from LDS is only around 850 GB/s on 5870 (should be 2 TB/s according to theoritical max) and that too for only single-component data type, using vec4 reads from LDS gives much lesser bandwidth which the sample uses.