You could try the amd opencl blas library
You might want to check out the gemm family of functions in that library (sgemm, dgemm, etc.)
Yeah right. Check out clAmdBlas library's gemm routine.
As a suggestion, I would not recommend you to do a 127X127 matrix multiplication on GPUs. It may be better if you increase the size of matrices by adding some padding, to actually make a multiple of 2. The work distribution can be quite unoptimal for odd sized matrices.