hi,
I was looking at the performance of matrix multiplication with Stream OpenCL in terms of GFLOPS and I was surprised to see that the results are underwhelming: I don't even reach 10GFLOPS on a machine that has a theoretical peak of 85.12GFLOPS for single precision...
here's the plot of the matrix multiplication results I got for the Stream SDK sample matrix multiplication and gotoBLAS sgemm:
http://img638.imageshack.us/img638/1225/gflopsvsmatrixorder.png
can someone explain me why I get these really bad performance results?