Archives Discussions

jbreitbart · ‎02-18-2010

Hi all,

I have just been playing around with the SDK sample I and get a strange result (see below). Could someone please run the sample with the same options and tell me if this is a bug in the SDK sample or a local problem?

I am using the 2.01 SDK and a Radeon 5870 / Core i7.

Here is my output:

jbreitbart@nv280 16:37:44 /usr/local/ati-stream/samples/opencl/bin/x86_64

$ ./MatrixMultiplication --device cpu -t -x 128 -y 192 -z 256 -e -q

Executing kernel for 1 iterations

-------------------------------------------

KernelTime (ms) : 14.752

GFlops achieved : 0.852965

Failed

jbreitbart@nv280 16:37:51 /usr/local/ati-stream/samples/opencl/bin/x86_64

$ ./MatrixMultiplication --device gpu -t -x 128 -y 192 -z 256 -e -q

Executing kernel for 1 iterations

-------------------------------------------

KernelTime (ms) : 0.081736

GFlops achieved : 153.946

Failed

n0thing · ‎02-18-2010

Change line 429 in MatrixMultiplication.cpp file - 'input0' ----> 'input1'.

n0thing · ‎02-18-2010

And I get around 173 GFlops with a Radeon 5770 using the above sizes.

jbreitbart · ‎02-19-2010

Thanks, that seems to help.

What performance do you get with larger matrix sizes, say

./MatrixMultiplication --device gpu -t -x 3200 -y 3200 -z 3200 -q

My result looks like:

Executing kernel for 1 iterations

-------------------------------------------

KernelTime (ms) : 174.31

GFlops achieved : 375.974

MatrixA MatrixB Time(sec) KernelTime(sec)

3200x3200 3200x3200 1.305 0.228

n0thing · ‎02-19-2010

I only get 35 Gflops at that size. (3200 x 3200)

Can you try with block-size 8 (-b 8), as that seems to give me more performance than the default 16.

jbreitbart · ‎02-19-2010

Using a blocksize of 8 for the small matrix size actually decreases performance to about 147 GFflops. The performance with the large matrix size is increased to abou 475 GFlops.

Do you mind posting the specification of the rest of your hardware? Are you using Windows or Linux?

vignyan · ‎02-19-2010

Hi,

On Mobility Radeon HD4650/ core2 duo P7350 on Ubuntu 9.10, i got the following results.. How can HD5770 (more processors, higher frequency) perform worse?

./MatrixMultiplication --device gpu -t -x 3200 -y 3200 -z 3200 -q -b 8

Executing kernel for 1 iterations
-------------------------------------------
KernelTime (ms) : 1587.01
GFlops achieved : 41.2954

MatrixA MatrixB Time(sec) KernelTime(sec)
3200x3200 3200x3200 2.954 1.916

n0thing · ‎02-21-2010

I think I have a driver problem because when I run the same sample on another system with a 5770 I am getting around 220 GFlops for 3200x3200 size. Time to cleanup and reinstall. (: Btw, I tried using a 8x4 register-block so that each thread writes 32 float values (currently this is only 16) but I got less performance as the register pressure was too high. I tried loading the matrix B also into local memory but low performance again. I can't see any other way to improve the performance.

jbreitbart · ‎02-22-2010

There is a long thread at B3D about matrix multiplication at ATI hardware:

http://forum.beyond3d.com/showthread.php?t=54842

It is not about OpenCL, but maybe you can get some ideas there.

Archives Discussions

Matrixmultiplication SDK Sample - Not correct for all input sizes?