Hi Micah,
The numbers I mentioned were the ALU peaks for the card; I can reproduce your numbers as being the best the texture units can do
assuming an 8x4 submatrix. If you double the submatrix size, to 16x8, these double too, and for a sufficiently large submatrix you should be able to approach the ALU peak, which of course is a hard upper bound.
My calculation is as follows. An MxN submatrix is obtained by multiplying an Mxk by a kxN matrix, so needs k(M+N) elements read in via texture. There are (m/M)*(n/N) submatrices, so the total number of elements that need to be read in is (m/M)(n/N)k(M+N)=(2mnk)*0.5*(1/M+1/N), i.e. the number of FLOPs (2mnk) times 0.5*(1/M+1/N).
So the texture time is FLOPs*0.5*(1/M+1/N)*Bpp/ (B/s), where Bpp is the number of bytes per element and B/s is the texturing rate in bytes per second.
Taking the texture time as a lower limit for the total time, we arrive at:
time > FLOPs*0.5*(1/M+1/N)*Bpp/ (B/s), or
GFLOP/s < 0.5*(1/M+1/N)*Bpp/ (GB/s)
Now I have found it hard to know quite how to understand what the theoretical rate can be (e.g. I read the 3870 can do 64-bit textures at "full speed" whereas the 4870 does them at "half speed", and that the 3870 can do unfiltered and filtered reads at the same time whereas the 4870 can't, but what this means for cal, particularly whether it makes a difference if you're sampling float2's or float4's or double2's, and thus say the formulae presented in the pdf you mention?), but from a "Terascale Graphics Engine" pdf by Hartog I think the 4870 can do 480GB/s peak, and up to the comments above the 3870 about 200 GB/s peak.
So for floats, and for an 8x4 subblock, we indeed get the 240/640 numbers you mention. However, for an 8x16 subblock, we already get 480/1280, i.e. around the ALU limit.
So, excepting possible latency issues, I don't see why your cards shouldn't really be going at full ALU speed in either single precision or double precision! Do you agree?
To really hit this one does of course need all data in the L1 cache at the right time, i.e. needs some sharing between wavefronts, which is tricky for me to know about not knowing the details. I tried bigger subblocks even on a 3870 (see "matmult" and "bothstripedmmm"
in here) using a pixel shader reading in partitioned matrices as a guess and outputting to global memory (thus avoiding the 8x4 limit) but it didn't show much advantage.
The number of waverfronts needed to avoid texture waits is a problem; hence a) my concern about inefficient register usage and b) the attempted use of LDS to store some info -- I don't know if it is faster than/independent of the regular texturing stuff.
Perhaps our whole discussion is an example where I think comprehensive hardware information, particularly about the memory system and about how wavefronts are actually scheduled and ran, would really help developers implementing algorithms, particularly given how long it takes to program in IL! I'm sure AMD would really benefit from any risk taken in this way as it should allow it to be seen that people are really able to take full advantage of the hardware. I think the latter is essential for hpc -- each gpu really has to do much better than a cpu to make using a significant gpu cluster worthwhile. Given the points you've already made I think I can improve performance non-trivially with a few simple tweaks!
Best wishes,
Steven.