Hi all,
I have just been playing around with the SDK sample I and get a strange result (see below). Could someone please run the sample with the same options and tell me if this is a bug in the SDK sample or a local problem?
I am using the 2.01 SDK and a Radeon 5870 / Core i7.
Here is my output:
jbreitbart@nv280 16:37:44 /usr/local/ati-stream/samples/opencl/bin/x86_64
$ ./MatrixMultiplication --device cpu -t -x 128 -y 192 -z 256 -e -q
Executing kernel for 1 iterations
-------------------------------------------
KernelTime (ms) : 14.752
GFlops achieved : 0.852965
Failed
jbreitbart@nv280 16:37:51 /usr/local/ati-stream/samples/opencl/bin/x86_64
$ ./MatrixMultiplication --device gpu -t -x 128 -y 192 -z 256 -e -q
Executing kernel for 1 iterations
-------------------------------------------
KernelTime (ms) : 0.081736
GFlops achieved : 153.946
Failed
Change line 429 in MatrixMultiplication.cpp file - 'input0' ----> 'input1'.
And I get around 173 GFlops with a Radeon 5770 using the above sizes.
Thanks, that seems to help.
What performance do you get with larger matrix sizes, say
./MatrixMultiplication --device gpu -t -x 3200 -y 3200 -z 3200 -q
My result looks like:
Executing kernel for 1 iterations
-------------------------------------------
KernelTime (ms) : 174.31
GFlops achieved : 375.974
MatrixA MatrixB Time(sec) KernelTime(sec)
3200x3200 3200x3200 1.305 0.228
I only get 35 Gflops at that size. (3200 x 3200)
Can you try with block-size 8 (-b 8), as that seems to give me more performance than the default 16.
Using a blocksize of 8 for the small matrix size actually decreases performance to about 147 GFflops. The performance with the large matrix size is increased to abou 475 GFlops.
Do you mind posting the specification of the rest of your hardware? Are you using Windows or Linux?
Hi,
On Mobility Radeon HD4650/ core2 duo P7350 on Ubuntu 9.10, i got the following results.. How can HD5770 (more processors, higher frequency) perform worse?
./MatrixMultiplication --device gpu -t -x 3200 -y 3200 -z 3200 -q -b 8
Executing kernel for 1 iterations
-------------------------------------------
KernelTime (ms) : 1587.01
GFlops achieved : 41.2954
MatrixA MatrixB Time(sec) KernelTime(sec)
3200x3200 3200x3200 2.954 1.916
There is a long thread at B3D about matrix multiplication at ATI hardware:
http://forum.beyond3d.com/showthread.php?t=54842
It is not about OpenCL, but maybe you can get some ideas there.