I want to code an openCL routine for sgemm and want to optimize it.
I have read the following threads related to this:
To optimise their code people have adopted:
2. Using texture cache instead of LDS
3. Using register files
They have used CAL ,Brook++ & IL to program the kernel. However, CAL is soon going to be deprecated in favour of openCL.
My question is:
How do I optimize my matrix multiply code using openCL on cayman?
I have already implemented tiling and computing in the local storage. But the results are very bad.
There is one kernel in SDK samples for matrixmultiplication which mihgt be helpful to you. Also sgemm & dgemm are already available in library clamdblas, so you can use that directly :)