Yeah right. Check out clAmdBlas library's gemm routine.
As a suggestion, I would not recommend you to do a 127X127 matrix multiplication on GPUs. It may be better if you increase the size of matrices by adding some padding, to actually make a multiple of 2. The work distribution can be quite unoptimal for odd sized matrices.