cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Marix
Adept II

High Performance Linpack and DGEMM for Cypress GPUs

As I have seen questions regarding Linpack in the forums before I want to point out that we just released the Linpack code that was run on LOEWE-CSC to put in on #22 in Novermber 2010's Top 500. We also published the DGEMM implementation for Cypress type GPUs that we used along with some documentation. Note that it is written in CAL, not in OpenCL, though. The DGEMM can reach about 623 Gflops on 2 Magny-Cours + 1 AMD 5870.

You can grab everything from http://code.compeng.uni-frankfurt.de.

Have fun!

0 Likes
8 Replies
d_a_a_
Adept II

What an impressive achievement! But why did you choose Cal instead of OpenCL? Are there plans to write an OpenCL version?

0 Likes
Lev
Journeyman III

 

What is performance of 5870 itself? I understand it is total performance of a cpu+gpu. Am I right?

0 Likes

Reason to use CAL was that the first really fast kernel was done in CAL. With all the lessons learned it would probably also be possible to do an OpenCL kernel of equivalent speed. Actually, I heard some rumors that somebody is working on it.

The 5870 on its own can do 497 Gflops DGEMM.

 

0 Likes
Lev
Journeyman III

Can you please tell precise full performance of multiplication of 2 matrixes about some thousands elements with load and store data from gpu and kernel launch?

0 Likes

Those are some really impressive numbers. What kind of workloads will this machine be running primarily? Do you think these HPL benchmark numbers will translate well to these workloads?

Cheers,

Dominic

0 Likes

Originally posted by: Lev Can you please tell precise full performance of multiplication of 2 matrixes about some thousands elements with load and store data from gpu and kernel launch?


I assume you mean including transfers from / to GPU. In that case you will get > 460 Gflops. For more numbers you can check the technical report at http://code.compeng.uni-frankfurt.de/projects/caldgemm/documents.

Originally posted by: dmeiser Those are some really impressive numbers. What kind of workloads will this machine be running primarily? Do you think these HPL benchmark numbers will translate well to these workloads?


The machine will be running a wide variety of workloads, especially from the natural sciences at the University. Not all of those have GPU codes, but it is not a GPU centric machine, as we only have 1 GPU for 24 cores. We expect to have significant GPU usage, though. I myself are working on GPU code for one of the projects. However we cannot really tell until it has been in use for like six months or so.

0 Likes

We did implement DGEMM in OpenCL on ATI GPUs and the speed is good, but not as good as CAL.

www.netlib.org/lapack/lawnspdf/lawn228.pdf

We get just over 300GFlops/s for DGEMM and 1.4TFlops/s for SGEMM

0 Likes
Lev
Journeyman III

Do you require linux? Btw, could you just implement acml gpu interface? I assume your dgemm is a bit faster.

0 Likes