The following questions are about this lecture:
1. In page 7, how to calculate Theoretical Peak (GFlops) of my kernel?
2. Also in page 7, look at the perf workbook:
Perf Workbook(1k*1k):
IMO, active wavefront count(per SIMD) = floor(256 / GPRs per thread), right?
But according to this lecture, , it seems not.
4. In page 16, it mentions that
Assuming 100% cache hit rate
Micah,
Thank you first. Please help me out of the following problems:
1. Through the formula in the doc you gave, for ALU time and Tex time, I can get close results to yours, but I still have some problem with the bandwidth under the condition of 0% and 75% cache hit.
There are 4 outputs, with each float4 format, right? Then according to the formula, the Mem time = (262144 * 4 * 128 bit) / (256 * 1150Mhz * 2) , which is quite different with yours. What is wrong?
And I don't quite understand how cache hit ratio affects the output time? IMO, cache hit ratio only deals with reading data.
2. Can you explain the way to calculate the active wavefront count? I think it should be floor(256 / GPRs per thread), not floor(160 / GPRs per thread). And "In CAL if register count > 12, 160 physical registers are made available", where can I find this information?
3. About the cache lines count. In a wavefront, each thread reads 4 float4 from 4 different parts of A in a loop, and each cache line is a 4 * 2 block data, which is 4 * 2 * 16B = 128B, how did you get that A hits 4 cache lines?
And in the animation for kernel 1,
WF 0: 4 * 16B (16CL)
WF 0: 16 * 16B (16CL)
WF 1: 16 * 16B (16CL)
WF 1: 16 * 16B (16CL)
can you explain why amount of data of A is different from B in wavefront 0?
how does "16 cache lines" come out?
In wavefront 1, amount of data of A becomes 16 * 16B, why?
Too many boring questions, hope that won't dispirit you