Most/all off-the-shelf routines for doing matrix-matrix multiplication are suitable
for large matrices. The problems I am trying to run on a gpu (280X) involve large
number - typically 200-300K - of relatively small (~ 40x40) matrices and they come in
batches of 2-3K (all calculation must be done in fp64). I have written my own
kernel for doing these using LDS and VGPRs in various combinations, but
still, I cannot beat a 6-core cpu with omp.
I was wondering if anyone has any info or suggestions for doing this type of problem
on a tahiti gpu.