1 Reply Latest reply on Jul 12, 2010 7:07 AM by cjang

    SDK Sample MatrixMultiplication - bad performance on CPU?



      I was looking at the performance of matrix multiplication with Stream OpenCL in terms of GFLOPS and I was surprised to see that the results are underwhelming: I don't even reach 10GFLOPS on a machine that has a theoretical peak of 85.12GFLOPS for single precision...


      here's the plot of the matrix multiplication results I got for the Stream SDK sample matrix multiplication and gotoBLAS sgemm:



      can someone explain me why I get these really bad performance results?

        • SDK Sample MatrixMultiplication - bad performance on CPU?

          Kazushige Goto's BLAS is hand crafted ISA optimized math kernels. It is likely to be near the maximum performance possible in practical situations. It achieves about 67% utilization at 2048 on your chart.

          Demo and sample code is generally not highly optimized. It is written to be easy to understand. So performance is lower.

          Tuning and optimization is required for high performance. This is not unique to GPUs. High performance CPU BLAS implementations like gotoBLAS (hand crafted ISA) and ATLAS (auto-tuning) have large investments in tuning and optimization to fit the hardware and compiler. It's not simple. High performance kernels are not a trivial problem that translators like the GCC or OpenCL compilers can do themselves. Optimizations at a higher level are required.