Those slides are in reference to pixel shader on RV670 and do not map exactly to compute shader programming models.
1) Timing information can be found here: http://coachk.cs.ucf.edu/cours.../s08/PerfModeling.pdf
2) These were assumptions I was making assuming I was getting 0% of peak bandwidth and 75% of peak bandwidth(i.e. 0% cache hits and 75% cache hits)
4) 120 cycles is the number of cycles it takes to processes a Tex instruction/L1 cache hit on an RV670 chip. I want to calculate the total amount of time it takes to execute all wavefronts, not the amount of latency that needs to be covered up.
5) If you map out how all the data reads for a single wavefront, A and B^T both hit 4 cachelines and B hits 2 cachelines, this is because it is a 4x8 versus and 8x4
6) This animation shows how data fits into the cache and how it overflows the cache because too many wavefronts are bringing in data at once.
Thank you first. Please help me out of the following problems:
1. Through the formula in the doc you gave, for ALU time and Tex time, I can get close results to yours, but I still have some problem with the bandwidth under the condition of 0% and 75% cache hit.
There are 4 outputs, with each float4 format, right? Then according to the formula, the Mem time = (262144 * 4 * 128 bit) / (256 * 1150Mhz * 2) , which is quite different with yours. What is wrong?
And I don't quite understand how cache hit ratio affects the output time? IMO, cache hit ratio only deals with reading data.
2. Can you explain the way to calculate the active wavefront count? I think it should be floor(256 / GPRs per thread), not floor(160 / GPRs per thread). And "In CAL if register count > 12, 160 physical registers are made available", where can I find this information?
3. About the cache lines count. In a wavefront, each thread reads 4 float4 from 4 different parts of A in a loop, and each cache line is a 4 * 2 block data, which is 4 * 2 * 16B = 128B, how did you get that A hits 4 cache lines?
And in the animation for kernel 1,
WF 0: 4 * 16B (16CL)
WF 0: 16 * 16B (16CL)
WF 1: 16 * 16B (16CL)
WF 1: 16 * 16B (16CL)
can you explain why amount of data of A is different from B in wavefront 0?
how does "16 cache lines" come out?
In wavefront 1, amount of data of A becomes 16 * 16B, why?
Too many boring questions, hope that won't dispirit you
1) this time is for going all the way out to memory to read data, not for output.
2) This information is not public except for this statement. It only applies to pixel shader and not in compute shader. In compute shader mode, all the registers except for a few are allocated to be used, in pixel shader mode, registers are allocated to other segments of the graphics pipeline. That puts the limit at 160 registers.
3) Because the wavefront is allocated in a 8x8 block, so there are 8 rows of data being read at the same time. Since the cache size is 2 rows high, 8/2 = 4 cache lines. There are 4 input buffers, so 4 cache lines per input buffer * 4 input buffers = 16 cache lines. The difference between A and B is a typo and it should be the same for everything.