I want to code an openCL routine for sgemm and want to optimize it.
I have read the following threads related to this:
To optimise their code people have adopted:
2. Using texture cache instead of LDS
3. Using register files
They have used CAL ,Brook++ & IL to program the kernel. However, CAL is soon going to be deprecated in favour of openCL.
My question is:
How do I optimize my matrix multiply code using openCL on cayman?
I have already implemented tiling and computing in the local storage. But the results are very bad.