Small Matrix Multiplication

Question asked by dns.on.gpu on Apr 16, 2015
Most/all off-the-shelf routines for doing matrix-matrix multiplication are suitable

for large matrices. The problems I am trying to run on a gpu (280X) involve large

number - typically 200-300K - of relatively small (~ 40x40) matrices and they come in

batches of 2-3K (all calculation must be done in fp64). I have written my own

kernel for doing these using LDS and VGPRs in various combinations, but

still, I cannot beat a 6-core cpu with omp.


I was wondering if anyone has any info or suggestions for doing this type of problem

on a tahiti gpu.