How best to map NDRange to the problem at hand?

Been reading about NDRange and am wondering about how best to map it to the problem at hand.  NDRange is a 1, 2, or 3 dimensional space where each element corresponds to a kernel instance.  NDRange appears to best map to the architectural layout of the GPU.

If I have two 10K by 10K matrices and wish to multiply them, undoubtedly I would choose a 2D NDRange.  As large as possible?  But since these matrices are beyond the capacity for the GPU how should I best map the A and B matrices to the 2D NDRange available?


that is a out of core matrix mutiplication problem you are talking about.

The answer is you will have to divide the matrices into blocks( say divide A matrix in rows and B in columns). Then send these blocks one by one multiply.