You have to divide the problem into sub problems and run corresponding kernels for the sub problems.
Suppose that you have to calculate C = A * B
Please have a look at Matrix Multiplication sample in AMD APP SDK samples.
Suppose that you have to calculate C(m X k) = A(m X n)*B(n X k).
Divide C matrix into four matrices of size m/2 X k/2, A matrix into two parts of size m/2 X n and B matrix into two parts of size n X k /2. I think now you got the clear picture of how it works. You have to calculate four matrix multiplications to get the resultant C.