cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jbreitbart
Journeyman III

Matrixmultiplication SDK Sample - Not correct for all input sizes?

Hi all,

 

I have just been playing around with the SDK sample I and get a strange result (see below). Could someone please run the sample with the same options and tell me if this is a bug in the SDK sample or a local problem?

I am using the 2.01 SDK and a Radeon 5870 / Core i7.

Here is my output:

jbreitbart@nv280 16:37:44 /usr/local/ati-stream/samples/opencl/bin/x86_64

$ ./MatrixMultiplication --device cpu -t -x 128 -y 192 -z 256 -e -q

Executing kernel for 1 iterations

-------------------------------------------

KernelTime (ms) : 14.752

GFlops achieved : 0.852965

 

Failed

 

jbreitbart@nv280 16:37:51 /usr/local/ati-stream/samples/opencl/bin/x86_64

$ ./MatrixMultiplication --device gpu -t -x 128 -y 192 -z 256 -e -q

Executing kernel for 1 iterations

-------------------------------------------

KernelTime (ms) : 0.081736

GFlops achieved : 153.946

 

Failed



0 Likes
8 Replies
n0thing
Journeyman III

Change line 429 in MatrixMultiplication.cpp file - 'input0' ----> 'input1'.

 

 

0 Likes

And I get around 173 GFlops with a Radeon 5770 using the above sizes.

0 Likes

Thanks, that seems to help.

What performance do you get with larger matrix sizes, say

./MatrixMultiplication --device gpu -t -x 3200 -y 3200 -z 3200 -q

My result looks like:

Executing kernel for 1 iterations

-------------------------------------------

KernelTime (ms) : 174.31

GFlops achieved : 375.974

 

MatrixA                  MatrixB                  Time(sec)                KernelTime(sec)          

3200x3200                3200x3200                1.305                    0.228               



0 Likes

I only get 35 Gflops at that size. (3200 x 3200)

Can you try with block-size 8 (-b 8), as that seems to give me more performance than the default 16.

0 Likes

Using a blocksize of 8 for the small matrix size actually decreases performance to about 147 GFflops. The performance with the large matrix size is increased to abou 475 GFlops.

Do you mind posting the specification of the rest of your hardware? Are you using Windows or Linux?

0 Likes

Hi,

On Mobility Radeon HD4650/ core2 duo P7350 on Ubuntu 9.10, i got the following results.. How can HD5770 (more processors, higher frequency) perform worse?

./MatrixMultiplication --device gpu -t -x 3200 -y 3200 -z 3200 -q -b 8

Executing kernel for 1 iterations
-------------------------------------------
KernelTime (ms) : 1587.01
GFlops achieved : 41.2954

MatrixA                  MatrixB                  Time(sec)                KernelTime(sec)         
3200x3200                3200x3200                2.954                    1.916

 

0 Likes

I think I have a driver problem because when I run the same sample on another system with a 5770 I am getting around 220 GFlops for 3200x3200 size. Time to cleanup and reinstall. (: Btw, I tried using a 8x4 register-block so that each thread writes 32 float values (currently this is only 16) but I got less performance as the register pressure was too high. I tried loading the matrix B also into local memory but low performance again. I can't see any other way to improve the performance.
0 Likes

There is a long thread at B3D about matrix multiplication at ATI hardware:

http://forum.beyond3d.com/showthread.php?t=54842

It is not about OpenCL, but maybe you can get some ideas there.

0 Likes