8 Replies Latest reply on Aug 20, 2010 9:08 AM by cjang

    DGEMM results

    ryta1203

      Does anyone have any results for DGEMM for AMD GPUs?

      Some Fermi-based Teslas in some bencmarks are only getting 180Gflops for DGEMM, which I find very surprising...

      ...I would think that AMD GPUs would perform better for this type of algorithm but I'm having a hard time finding any results for the 58xx series for DGEMM.

      Anyone?

        • DGEMM results
          rick.weber

          If alpha = 1.0 and beta = 0.0; m and n are multiples of 4; k is a multiple of 2; and m <= 16384, n <= 8192, k < 8192, you can use nnsan's IL kernel to acheive just under 500 GFlops/s.

          http://galaxy.u-aizu.ac.jp/trac/note/wiki/MatrixMultiply

          See thread:

          http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963&STARTPAGE=4&FTVAR_FORUMVIEWTMP=Linear

          Unfortunately, this kernel reads A and B from images rather than global memory, so there are more constraints on their dimensions. Also, A has to be transposed. C is written to global memory. So, if your matrices are normally stored on the GPU in global memory, you'll have to write some kernels to get the packing and transposition in the images working. Furthermore, this kernel is row major, so you'll have to come up with an analagous kernel for column major. But if the stars align just right, you can expand on this impressive work to run circles around Fermi.

          • DGEMM results
            cjang

            With Stream SDK v2.2 and Catalyst 10.7b on a HD 5870, I see OpenCL DGEMM kernels using images reach 303 gigaFLOPS at M/N/K = 960 and 340 gigaFLOPS at M/N/K = 3520. If matrix A is transposed, then performance is 366 gigaFLOPS at M/N/K = 3200. The benchmarks are averages over ten trials for peak kernel variants. It's lower than ISA of course but still very respectable and follows what I've seen of SGEMM performance.

              • DGEMM results
                nnsan

                cjang,

                I hava updated our DGEMM kernels and benchmark results. Your results with AB is indeed very close (even outperform) our results. N=3200, our kernel shows 358 GF. Great work! For A^t B, a gap still exists though. For other comparison, see our web page.

                By the way, do you see performance increase moving from SDK 2.1 from SDK 2.2?

                  • DGEMM results
                    cjang

                    There is a large increase in performance from SDK v2.1 to v2.2 due to OpenCL double precision mad() support.

                    In SDK v2.1 without mad(), DGEMM performance reaches 216 gigaFLOPS at M/N/K = 960 and 242 gigaFLOPS at M/N/K = 3520. With matrix A transposed, the best is 263 gigaFLOPS at M/N/K = 3200. So there's roughly a 100 gigaFLOPS jump from SDK v2.1 to v2.2 due to mad().

                    However, I believe that SDK v2.2 is actually slightly slower than v2.1 when mad() is not used. The difference is very slight, perhaps 5 to 10 gigaFLOPS. I'd have to rerun benchmarks to make sure. But I seem to recall seeing this.

                    It is difficult to make direct comparisons between v2.1 and v2.2 as there appear to have been major changes between SDK versions. I am using auto-tuning with machine generated kernels. With v2.1, many kernel variations are rejected during search because the output data is bad. This does not happen with v2.2. There was also a very large jump in performance from v2.0 to v2.1 (like 30% in some cases). So I think v2.1 has an aggressive compiler. With v2.2, the compiler may be a little less aggressive.

                      • DGEMM results
                        ryta1203

                        cjang

                          Could the non-double mad performance difference be due to the register allocation? What is the register usage between 2.2 and 2.1 for the kernel(s) you are using?

                          • DGEMM results
                            cjang

                            It will take quite a bit more work to answer this. I am using auto-tuning with machine generated kernels. So any differences I saw were the result of running thousands of different kernels and casually noting any (statistically significant?) performance gap between the fastest kernels at particular matrix sizes. As the fastest kernels are likely to be different between SDK and driver versions, making direct comparisons is difficult.

                            I just tried using the GPU_DUMP_DEVICE_KERNEL environment variable. Nice. I agree white box analysis of generated shader code is very useful. It's just that for the style of optimization I am using, it is harder to do this kind of analysis. I rely on the brute force of an automated search. I am not tuning a small number of kernels by hand. I may see a high level pattern in what the machine is doing but drilling down to the details is not always easy.

                            However, this is a good line to pursue. I will look at it again soon.