Hi,
I've run a benchmark that does dense matrix-matrix multiplication (dgemm operation in blas3) in double precision on a radeon 7970 gpu. When I use the dgemm function provided in clAmdBlas I measure about 150GFLOP/s. When I run the same benchmark using ViennaCL I get about 220GLOP/s, i.e. it's significantly faster. Could this be an issue of clAmdBlas not being tuned for tahiti yet? As far as I can tell the kernels in clAmdBlas are precompiled into the so file. Is it possible that the compiler wasn't as tweaked when the so file was generated (December based on clAmdBlas release date) as it is now?ViennaCL compiles its kernel from source.
Another reason why I suspect this could be a compiler optimization issue is that in single precision I get slightly better performance on a 6870 as I do on the 7970.
Shouldn't I expect a significant fraction of the theoretical peak flops in both double precision and single precision for this ALU bound computation?
Thanks a lot in advance.
Dominic
Solved! Go to Solution.
Are you talking about DGEMM kernel performance or performance measured at the CPU side including data
transfer between CPU and GPU?. I assume you talked about DGEMM kernel performance.
The DGEMM kernel doesn't work well with the current driver (8.921) which is available
at AMD public web site. AMD will release a new driver soon. DGEMM runs much faster (2.5X)
against this new driver.
Thanks for your post, Dominic. One of our BLAS engineers has taken a look at your post. We'll respond soon.
Bragadeesh,
Thanks for your response. In case it helps I have also tried a radeon 5970 which gives very similar performance as the other two cards (7970 and 6870) in single precision.
Looking forward to an answer from the BLAS engineers.
Cheers,
Dominic
Are you talking about TN variant?
Dear sarnath,
I'm talking about the NN kernel. I did try all other combinations (TN, NT, TT) and got similar results.
Cheers,
Dominic
Are you talking about DGEMM kernel performance or performance measured at the CPU side including data
transfer between CPU and GPU?. I assume you talked about DGEMM kernel performance.
The DGEMM kernel doesn't work well with the current driver (8.921) which is available
at AMD public web site. AMD will release a new driver soon. DGEMM runs much faster (2.5X)
against this new driver.
Dear solver,
Thanks a lot for your answer.
I was comparing just kernel performance. All buffers are transferred to the GPU before the kernel is launched. Once the new driver is available will I have to update clAmdBlas or should dgemm just be faster automatically with the clAmdBlas library I'm currently using? Any estimate of when this new driver is going to be available?
Thanks again.
Dominic
The new driver will probably be available on Feb. 29.
You may improve the DGEMM performance on the current driver through running
clAmdBlasTune under direcotory src/tools/tune.
Assume you are using bash, first type the following command on the termial:
export AMD_CLBLAS_STORAGE_PATH=your_kdb_directory
And then, type 'clAmdBlasTune --gemm --double --store-kernel' on the terminal.
clAmdBlas will tune the parameters and dump data to a file named Tahiti.kdb located
in the directory 'your_kdb_directory'.
Can u explain me in brief ?m new here.