Been reading about NDRange and am wondering about how best to map it to the problem at hand. NDRange is a 1, 2, or 3 dimensional space where each element corresponds to a kernel instance. NDRange appears to best map to the architectural layout of the GPU.
If I have two 10K by 10K matrices and wish to multiply them, undoubtedly I would choose a 2D NDRange. As large as possible? But since these matrices are beyond the capacity for the GPU how should I best map the A and B matrices to the 2D NDRange available?
that is a out of core matrix mutiplication problem you are talking about.
The answer is you will have to divide the matrices into blocks( say divide A matrix in rows and B in columns). Then send these blocks one by one multiply.