IT takes 3x longer than what? The CPU version?
For starters, 64x64 is way too small for GPU. Try 8192x8192 and see if the GPU is still 3x longer than CPU, or even just try a 2048x2048 or even a 1024x1024.
Also, as a side note, use float4s instead of floats.
If I do:
__local float junk;
inside my OpenCL kernel, the kernel runs about 3x slower than if that matrix is not declared. This is true even if I do not touch the matrix at all. And this is also true across all sizes of matrices being multiplied, so the additional time is not some constant overhead.
If you can provide a test case, we can see if it can be fixed in time for our next release.