Does anyone know a fast arbitrary size matrix multiplication algorithm/code on GPU?

The matrix multiplication from SDK seems only work when input matrix has a size of multiple of 16. For example, if input matrix is 127X127, it returns wrong results.

You could try the amd opencl blas library

http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-math-libraries/

You might want to check out the gemm family of functions in that library (sgemm, dgemm, etc.)