Showing results for 
Search instead for 
Did you mean: 

Archives Discussions


Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

Hi Anton,

I got your program compiled with OpenBLAS.

But unfortunately, that system did not have double-precision capable GPU 😞

So, I will find another machine, move it out and test this.

That will most likely be this Friday...

Thanks for the detailed information, it was very useful..


Is there a way to run only single precision tests?


Best Regards,


Journeyman III

Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

Is there somewhere I could download this build of High-Performance LINPACK from? I'm very interested in running it on my cluster of four 7970s to see how it compares.

Nathan Moos

Adept II

Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

We say hello to our dear heterogeneous computing friends! Today we will discuss the recent news from the battlefield, and unfortunately they are not the cheerful news.

For almost two years we have been obtaining all our results with a dozen old trusty AMD Radeons 7970 GHz Ed. New scientific plans leaded us to a choice of a hardware platform for the next 1-1,5 years. In addition, the new HAWAII cards seemed inspiring. Let's run the benchmarks, do some calculations and make the conclusions!

The soldiers of Applied Science are interested in at least two major properties of a device: peak performance in double precision and a global memory bandwidth.

We have tested the following accelerators (+prices in Moscow, written in $$$):

- Radeon 7970 Ghz Ed (1010MHz) (~350$)

- Radeon 7990 Reference (~650$)

- Radeon R9 290X Reference (~700$)

- Geforce TITAN Reference (~1000$)

All AMDs had 12.6 and 14.1b drivers, 331.38 for TITAN.

Here are the theoretical peaks of these devices and the best real performance (we used the synthetic OpenCL kernels from the clpeak project:😞


After that, we launched a mini-stress-test: 100 iterations with 10 kernel launches in each iteration. Resulting performance of an iteration is an average of 10 launches. That's how Marsellus Wallace the performance degradation looks like. Isn't it beautiful?


After looking at these diagrams, several questions arise.

1. It is unclear why one half of 7990 has 733 GFlop/s while 7970 has 1007 GFlop/s? As far as we know, these cards are equal, and they have the same frequencies and amount of SPUs.

2. Why does the kernel performance decrease on Tahiti as time goes by? This issue keeps appearing with all drivers after 12.6, and driver developers seem to do nothing about that.

3. Why the HAWAII card is so slow? The fact that it has DP/SP = 1/8 (compare with 1/4 on Tahiti) made us frustrated. In addition, we notice that our non-optimized DGEMM kernels get 650 out 704 GFlop/s, which is 92% - unrealistic number. So we hypothesize that the chip has full 1,4 TFlop/s performance, which is artificially (by software?) limited.

The next chart contains theoretical peaks of global memory bandwidth:


The real bandwidth of all devices was measured with GlobalMemoryBandwidth and MemoryOptimization tests from AMD APP SDK:



Important moments:

1. In some aspects AMD drivers become better and better, and these improvements greatly affect the overall performance.

2. All AMD GPUs demonstrate an exсellent memory subsystem work.

Unfortunately, we have to state that HAWAII is less suitable for scientific computations than Tahiti, and the time of cheap GPGPUs for scientists has ended.

As we think, up to this point AMD hardware was much better than hardware from NVidia. On the other hand, these advantages were neutralized by unstable drivers. At the moment, the AMD drivers became better, but there are some old bugs (like performance degradation) and some new ones, so some codes which worked with 12.6, don't work on the new drivers.

We have also tested the FirePro card (based on Tahiti architecture), and the situation was the same: unstable drivers and a lot of difficulties with multi-GPU systems.

So, the question is: what does AMD plan to do in the computational sector? Previously we had a cheap and fast hardware, which could be used despite the awful drivers. At the moment the drivers are still not fully operational, and soon the FirePro cards will become very expensive (who will buy them?..).

On the other hand, in some performance metrics NVidia devices are not as good as AMD ones, but they have the brilliant software support. We have tested a lot of different NVidia GPUs (GeForce and Teslas) and never encountered serious problems.

Almost all computational problems we investigate are memory-bound, and in this aspect all these GPUs are approximately equal. In the real launches AMD 7970s are usually better than TITANs, so we decided to stay on 7970. But we expect that NVidia will start selling something more fast in a year or so.

There is an opinion (not only ours), that if AMD will not change their slighting attitude to GPGPU sphere, a lot of scientific researches will start choosing NVidia.

From Russia with love,



Adept III

Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

Thanks Anton, I just wanted to show my appreciation for your work and for reporting your findings.

My opencl application is heavily dependent on double precision performance.

A lot of people out there have the attitude that you only need double precision for a few scientific problems.

The fact that every major computer language uses double precision for calculations and float is just a storage format eludes them. Ditto Excel using 16 digits of precision.

It would be great if someone could write a paper as to why double precision is so important for all uses.

Someone needs to invent a bitcoin that depends on double precision (or higher..) performance. (finding locations of a certain pattern in a mandlebot?)

Perhaps this would raise the desire for high DP performance in the general user base.