4 Replies Latest reply on Aug 28, 2009 4:46 PM by ilghiz

    Small Lapack & ACML (with GPU) Performance Test needed



      our company is going to upgrate a hardware for internal usage and we are thinkig what to take. Before, on pure CPU time, AMD platforms was cheaper in regards to achieved GFlop/s per dollar. Right now we have some development on NVIDIA card, so, NVIDIA GTX 260 + AMD 4 Cores Optherons are cheaper that AMD 4 Cores Optherons for us, however, we did not measure HD4870 + 4/6 Cores Optherons. We have no possibility to check GPU enabled ACML.

      Would anybody help us to run small test on GPUs like 4870?

      It is pure C: http://www.elegant-mathematics.com/software/em-acml-gpu-test.c

      Please, compile it with Lapack and ACML library under Linux. It takes several minutes to run and produce one page output.

      It would be very kind if you can run the same test with and without GPU acceleration and provide me both results together with your CPU and GPU configuration.

      Thank you



      Dr. Ilgis Ibragimov, VP

      Elegant Mathematics Ltd.

        • Small Lapack & ACML (with GPU) Performance Test needed

          I am trying to understand why nobody answered to my question here and on other GPU forums. It seems that it could be two reasons:

          1. nobody can run GPU enabled ACML library, but I hardly believe,

          2. this test can show not very good result compared to CPU (I do not want to believe it, but if it is true, than it is terrible),

          hence, I will support AMD for 150 Baks bying one 4870 card and checking this to know a real answer


          Ilgis Ibragimov

            • Small Lapack & ACML (with GPU) Performance Test needed

              Hello Dr. Ibragimov,

              First, I would like to apologize that no one responded to your original post here.  When you posted the reminder on 8/25, that was forwarded to me for a response.

              I downloaded your benchmark program, em-acml-gpu-test.c "Last Modified    21.07.08", and found it contained two bugs.

              The first bug is that the program would not compile until I inserted the definition "typedef int Int;"   It seems that in a couple of places, the type "Int" with uppercase I was used without that type having been defined anywhere.

              The second bug was more subtle.  As  I orginially downloaded it, the tests of SGESDD (single-precision singluar value decompositon) would fail.   After some debugging, I found the problem was a typo near line # 227.  See the code block below.

              As originally written, the data buffer is filled with IEEE double-precision random values, and then that array is passed to SGESDD as a matrix of single-precision values.  When the double-precision values's bit patterns are reinterpreted as single-preciion floating point values, many are out of range -- that is, they have a magnitude greater than 2**126.  Computing the SVD on such input would produce floatingpoint overflows and exceptions.  When the logic in SGESDD detects this input condition, it fails and prints an error message.

              After those two bugs were fixed, I was able to run this benchmark.

              Company policy does not permit me to post the results here (for various legal reasons, but mainly because neither the benchmark nor the computers I ran it on have been vetted by AMD's Performance Center of Excellence.)  I will, however, sent the output to you by private email.

              Most of the routines your are benchmarking have not been accellerated in ACML-GPU release 1.0.  We hope to have accelleration for more routines in future releases, but some of the subroutines in your benchmark -- SAXPY and DAXPY for example -- cannot benefit from GPU accelleration.  The reason is that the numeric intesity -- defined as the ratio of arithmetic operations to memory operations is too low, so these algorithms are inherently bound by the speed of access to memory.  Even if the GPU was infinitely fast in performing the arithmetic operations, the time required to transfer the data from the CPU's main memory to the GPU and back would still be longer than the time needed for the CPU to perform the calculation.

              I hope this answers your questions,


              The ACML team







              Edit: Fixed formatting issue


                • Small Lapack & ACML (with GPU) Performance Test needed

                  Dear Jim,

                  thank you for your kind help with the tests. It is true, we have this test with some make file where Int is typedefed as int or long depending on computer architecture. So, for simplisity I skip the makefile and forget to correct it appropriatelly.

                  You are definitelly right about float/double mistake, we never test single precision SVD before, and append this part only for future reasons.

                  I PMed you my e-mail, thankig for your kind work!



                  • Small Lapack & ACML (with GPU) Performance Test needed

                    Thank you for your kind results, I got them and they are impressive for DGEMM!

                    In regards to your comment about slow performance of SAXPY/DAXPY:

                    in case if ACML give to user a possibility to work with GPU memory pointers, you can increase AXPY speed 50 times, as it was done already on CUDA/CUBLAS platform. Many applications, like iterative solvers have very slow GFlop/s but requires a massive memory access, so, GPUs with fast memory access can help a lot!