I am trying to understand why nobody answered to my question here and on other GPU forums. It seems that it could be two reasons:
1. nobody can run GPU enabled ACML library, but I hardly believe,
2. this test can show not very good result compared to CPU (I do not want to believe it, but if it is true, than it is terrible),
hence, I will support AMD for 150 Baks bying one 4870 card and checking this to know a real answer
Hello Dr. Ibragimov,
First, I would like to apologize that no one responded to your original post here. When you posted the reminder on 8/25, that was forwarded to me for a response.
I downloaded your benchmark program, em-acml-gpu-test.c "Last Modified 21.07.08", and found it contained two bugs.
The first bug is that the program would not compile until I inserted the definition "typedef int Int;" It seems that in a couple of places, the type "Int" with uppercase I was used without that type having been defined anywhere.
The second bug was more subtle. As I orginially downloaded it, the tests of SGESDD (single-precision singluar value decompositon) would fail. After some debugging, I found the problem was a typo near line # 227. See the code block below.
As originally written, the data buffer is filled with IEEE double-precision random values, and then that array is passed to SGESDD as a matrix of single-precision values. When the double-precision values's bit patterns are reinterpreted as single-preciion floating point values, many are out of range -- that is, they have a magnitude greater than 2**126. Computing the SVD on such input would produce floatingpoint overflows and exceptions. When the logic in SGESDD detects this input condition, it fails and prints an error message.
After those two bugs were fixed, I was able to run this benchmark.
Company policy does not permit me to post the results here (for various legal reasons, but mainly because neither the benchmark nor the computers I ran it on have been vetted by AMD's Performance Center of Excellence.) I will, however, sent the output to you by private email.
Most of the routines your are benchmarking have not been accellerated in ACML-GPU release 1.0. We hope to have accelleration for more routines in future releases, but some of the subroutines in your benchmark -- SAXPY and DAXPY for example -- cannot benefit from GPU accelleration. The reason is that the numeric intesity -- defined as the ratio of arithmetic operations to memory operations is too low, so these algorithms are inherently bound by the speed of access to memory. Even if the GPU was infinitely fast in performing the arithmetic operations, the time required to transfer the data from the CPU's main memory to the GPU and back would still be longer than the time needed for the CPU to perform the calculation.
I hope this answers your questions,
The ACML team
Edit: Fixed formatting issue
thank you for your kind help with the tests. It is true, we have this test with some make file where Int is typedefed as int or long depending on computer architecture. So, for simplisity I skip the makefile and forget to correct it appropriatelly.
You are definitelly right about float/double mistake, we never test single precision SVD before, and append this part only for future reasons.
I PMed you my e-mail, thankig for your kind work!
Thank you for your kind results, I got them and they are impressive for DGEMM!
In regards to your comment about slow performance of SAXPY/DAXPY:
in case if ACML give to user a possibility to work with GPU memory pointers, you can increase AXPY speed 50 times, as it was done already on CUDA/CUBLAS platform. Many applications, like iterative solvers have very slow GFlop/s but requires a massive memory access, so, GPUs with fast memory access can help a lot!